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"The diagrams he uses are excellent and his writing style is clear and readable. In sum, 
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Foreword 


Rarely does one find a book on a well-known topic that is both historically and 
technically comprehensive and remarkably accurate. One of the things I admire 
about this work is the "warts and all" approach that gives it such credibility The 
TCP/IP architecture is a product of the time in which it was conceived. That it has 
been able to adapt to growing requirements in many dimensions by factors of a 
million or more, to say nothing of a plethora of applications, is quite remarkable. 
Understanding the scope and limitations of the architecture and its protocols is a 
sound basis from which to think about future evolution and even revolution. 

During the early formulation of the Internet architecture, the notion of "enter¬ 
prise" was not really recognized. In consequence, most networks had their own 
IP address space and "announced" their addresses in the routing system directly. 
After the introduction of commercial service, Internet Service Providers emerged 
as intermediaries who "announced" Internet address blocks on behalf of their cus¬ 
tomers. Thus, most of the address space was assigned in a "provider dependent" 
fashion. "Provider independent" addressing was unusual. The net result (no pun 
intended) led to route aggregation and containment of the size of the global rout¬ 
ing table. While this tactic had benefits, it also created the "multi-homing" prob¬ 
lem since users of provider-dependent addresses did not have their own entries 
in the global routing table. The IP address "crunch" also led to Network Address 
Translation, which also did not solve provider dependence and multi-homing 
problems. 

Reading through this book evokes a sense of wonder at the complexity that 
has evolved from a set of relatively simple concepts that worked with a small num¬ 
ber of networks and application circumstances. As the chapters unfold, one can 
see the level of complexity that has evolved to accommodate an increasing number 
of requirements, dictated in part by new deployment conditions and challenges, to 
say nothing of sheer growth in the scale of the system. 

The issues associated with securing "enterprise" users of the Internet also led 
to firewalls that are intended to supply perimeter security. While useful, it has 
become clear that attacks against local Internet infrastructure can come through 
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internal compromises (e.g., an infected computer is put onto an internal network 
or an infected thumb-drive is used to infect an internal computer through its USB 
port). 

It has become apparent that, in addition to a need to expand the Internet 
address space through the introduction of IP version 6, wifh ifs 340 frillion fril- 
lion frillion addresses, fhere is also a sfrong need fo infroduce various securify- 
enhancing mechanisms such as fhe Domain Name Sysfem Securify Exfension 
(DNSSEC) among many ofhers. 

Whaf makes fhis book unique, in my esfimafion, is fhe level of defail and affen- 
fion fo hisfory. If provides background and a sense for fhe ways in which solufions 
fo nefworking problems have evolved. If is relenfless in ifs efforf fo achieve preci¬ 
sion and fo expose remaining problem areas. Eor an engineer defermined fo refine 
and secure Infernef operafion or fo explore alfernafive solufions fo persisfenf prob¬ 
lems, fhe insighfs provided by fhis book will be invaluable. The aufhors deserve 
credif for a fhorough rendering of fhe fechnology of foday's Infernef. 

Woodhurst Vinf Cerf 

June 2011 



Preface to the Second Edition 


Welcome to the second edition of TCP/IP Illustrated, Volume 1. This book aims 
to provide a detailed, current look at the TCP/IP protocol suite. Instead of jusf 
describing how fhe protocols operate, we show fhe protocols in operafion using 
a variefy of analysis fools. This helps you beffer undersfand fhe design decisions 
behind fhe profocols and how fhey inferacf wifh each ofher, and if simulfaneously 
exposes you to implemenfafion defails wifhouf your having to read fhrough fhe 
implemenfafion's soffware source code or sef up an experimenfal laboratory Of 
course, reading source code or setting up a laborafory will only help to increase 
your undersfanding. 

Nefworking has changed dramafically in fhe pasf fhree decades. Originally a 
research projecf and objecf of curiosify, fhe Infernef has become a global commu- 
nicafion fabric upon which governmenfs, businesses, and individuals depend. The 
TCP/IP suite defines fhe underlying mefhods used to exchange informafion by 
every device on fhe Infernef. After more fhan a decade of delay, fhe Infernef and 
TCP/IP ifself are now undergoing an evolufion, to incorporate IPv6. Throughouf 
fhe fexf we will discuss bofh IPv6 and fhe currenf IPv4 fogefher, buf we high- 
lighf fhe differences where fhey are imporfanf. Unforfunafely, fhey do nof direcfly 
inferoperafe, so some care and affenfion are required to appreciafe fhe impacf of 
fhe evolufion. 

The book is infended for anyone wishing to beffer undersfand fhe currenf sef 
of TCP/IP profocols and how fhey operafe: nefwork operafors and adminisfrafors, 
nefwork soffware developers, sfudenfs, and users who deal wifh TCP/IP. We have 
included maferial fhaf should be of inferesf fo bofh new readers as well as fhose 
familiar wifh fhe maferial from fhe firsf edifion. We hope you will find fhe cover¬ 
age of fhe new and older maferial useful and inferesfing. 

Comments on the First Edition 

Nearly two decades have passed since the publication of the first edition of TCP/IP 
Illustrated, Volume 1. It continues to be a valuable resource for both students and 
professionals in understanding the TCP/IP protocols at a level of detail difficult to 
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obtain in competing texts. Today it remains among the best references for defailed 
informafion regarding fhe operafion of fhe TCP/IP protocols. However, even fhe 
besf books concerned wifh informafion and communicafions fechnology become 
dafed affer a fime, and fhe TCP/IP Illusfrafed series is no excepfion. In fhis edifion, 
I hope to fhoroughly updafe fhe pioneering work of Dr. Stevens wifh coverage of 
new maferial while mainfaining fhe excepfionally high sfandard of presenfafion 
and defail common to his numerous books. 

The firsf edifion covers a broad sef of protocols and fheir operafion, ranging 
from fhe link layer all fhe way to applicafions and nefwork managemenf. Today, 
covering fhis breadfh of maferial comprehensively in a single volume would 
produce a very lengfhy fexf indeed. For fhis reason, fhe second edifion focuses 
specifically on fhe core profocols: fhose relafively low-level protocols used mosf 
frequenfly in providing fhe basic services of configurafion, naming, dafa delivery, 
and securify for fhe Infernef. Defailed discussions of applicafions, roufing, Web 
services, and ofher imporfanf fopics are posfponed to subsequenf volumes. 

Considerable progress has been made in improving fhe robusfness and com¬ 
pliance of TCP/IP implemenfafions to fheir corresponding specificafions since fhe 
publicafion of fhe firsf edifion. While many of fhe examples in fhe firsf edifion 
highlighf implemenfafion bugs or noncomplianf behaviors, fhese problems have 
largely been addressed in currenfly available systems, af leasf for IPv4. This facf 
is nof ferribly surprising, given fhe greafly expanded use of fhe TCP/IP profocols 
in fhe lasf 18 years. Misbehaving implemenfafions are a comparafive rarify, which 
affesfs fo a cerfain mafurify of fhe profocol suife as a whole. The problems encoun- 
fered in fhe operafion of fhe core profocols nowadays offen relate fo infenfional 
exploifafion of infrequenfly used profocol feafures, a form of securify concern fhaf 
was nof a primary focus in fhe firsf edifion buf one fhaf we spend considerable 
efforf fo address in fhe second edifion. 

The Internet Milieu of the Twenty-first Century 

The usage patterns and importance of the Internet have changed considerably 
since the publication of the first edition. The most obvious watershed event was 
the creation and subsequent intense commercialization of the World Wide Web 
starting in the early 1990s. This event greatly accelerated the availability of the 
Internet to large numbers of people with various (sometimes conflicting) motiva¬ 
tions. As such, the protocols and systems originally implemented in a small-scale 
environment of academic cooperation have been stressed by limited availability of 
addresses and an increase of security concerns. 

In response to the security threats, network and security administrators have 
introduced special control elements into the network. It is now common practice to 
place a firewall at the point of attachment to the Internet, for both large enterprises 
as well as small businesses and homes. As the demand for IP addresses and secu¬ 
rity has increased over the last decade. Network Address Translation (NAT) is now 
supported in virtually all current-generation routers and is in widespread use. It 
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has eased the pressure on Internet address availability by allowing sites to obtain 
a comparatively small number of routable Internet addresses from fheir service 
providers (one for each simulfaneously online user), yef assign a very large num¬ 
ber of addresses fo local compufers wifhouf furfher coordinafion. A consequence 
of NAT deploymenf has been a slowing of fhe migrafion fo IPv6 (which provides 
for an almosf incomprehensibly large number of addresses) and inferoperabilify 
problems wifh some older protocols. 

As fhe users of personal compufers began fo demand Infernef connecfivify 
by fhe mid-1990s, fhe largesf supplier of PC soffware, Microsoft abandoned ifs 
original policy of offering only propriefary alfernafives fo fhe Infernef and insfead 
undertook an efforf fo embrace TCP/IP compafibilify in mosf of ifs producfs. 
Since fhen, personal compufers running fheir Windows operafing sysfem have 
come fo dominate fhe mix of PCs presenfly connecfed fo fhe Infernef. Over fime, 
a significanf rise in fhe number of Linux-based systems means fhaf such sysfems 
now fhreafen fo displace Microsoft as fhe fronfrunner. Ofher operafing sysfems, 
including Oracle Solaris and Berkeley's BSD-based sysfems, which once repre¬ 
sented fhe majorify of Infernef-connecfed sysfems, are now a comparafively small 
componenf of fhe mix. Apple's OS X (Mach-based) operafing sysfem has risen as 
a new contender and is gaining in popularify, especially among porfable com¬ 
puter users. In 2003, porfable computer (laptop) sales exceeded desktop sales as 
fhe majorify of personal compufer fypes sold, and fheir proliferafion has sparked 
a demand for widely deployed, high-speed Infernef access supporfed by wire¬ 
less infrasfrucfure. If is projecfed fhaf fhe mosf common mefhod for accessing fhe 
Infernef from 2012 and beyond will be smarfphones. Table! compufers also repre- 
senf an imporfanf growing confender. 

Wireless nefworks are now available af a large number of locafions such as 
resfauranfs, airporfs, coffeehouses, and ofher public places. They fypically pro¬ 
vide shorf-range free or pay-for-use (flaf-rafe) high-speed wireless Infernef con- 
necfions using hardware compafible wifh commonly used office or home local 
area nefwork insfallafions. A sef of alfernafive "wireless broadband" fechnolo- 
gies based on cellular felephone sfandards (e.g., LTE, HSPA, UMTS, EV-DO) are 
becoming widely available in developed regions of fhe world (and some develop¬ 
ing regions of fhe words fhaf are "leapfrogging" fo newer wireless fechnology), 
offering longer-range operafion, often af somewhaf reduced bandwidfhs and wifh 
volume-based pricing. Bofh fypes of infrasfrucfure address fhe desire of users fo 
be mobile while accessing fhe Infernef, using eifher porfable compufers or smaller 
devices. In eifher case, mobile end users accessing fhe Infernef over wireless nef¬ 
works pose fwo significanf technical challenges fo fhe TCP/IP profocol archi- 
fecfure. Pirsf, mobilify affecfs fhe Infernef's roufing and addressing sfrucfure by 
breaking fhe assumpfion fhaf hosfs have addresses assigned fo fhem based upon 
fhe idenfify of fheir nearby roufer. Second, wireless links may experience oufages 
and fherefore cause dafa fo be losf for reasons ofher fhan fhose fypical of wired 
links (which generally do nof lose dafa unless too much fraffic is being injecfed 
info fhe nefwork). 
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Finally, the Internet has fostered the rise of so-called peer-fo-peer applica- 
fions forming "overlay" nefworks. Peer-fo-peer applicafions do nof rely on a cen- 
fral server fo accomplish a fask buf insfead defermine a sef of peer compufers wifh 
which fhey can communicafe and inferacf fo accomplish a fask. The peer compufers 
are operafed by ofher end users and may come and go rapidly compared fo a fixed 
server infrasfrucfure. The "overlay" concepf capfures fhe facf fhaf such inferacf- 
ing peers fhemselves form a nefwork, overlaid afop fhe convenfional TCP/IP-based 
nefwork (which, one may observe, is ifself an overlay above fhe underlying physi¬ 
cal links). The developmenf of peer-fo-peer applicafions, while of infense inferesf 
fo fhose who sfudy fraffic flows and elecfronic commerce, has nof had a profound 
impacf on fhe core protocols described in Volume 1 per se, buf fhe concepf of overlay 
nefworks has become an imporfanf considerafion for nefworking technology more 
generally. 

Content Changes for the Second Edition 

Regarding content in the text, the most important changes from the first edition 
are a restructuring of the scope of the overall text and the addition of significant 
material on security. Instead of attempting to cover nearly all common protocols 
in use at every layer in the Internet, the present text focuses in detail first on the 
non-security core protocols in widespread use, or that are expected to be in wide¬ 
spread use in the near future: Ethernet (802.3), Wi-Fi (802.11), PPP, ARP, IPv4, IPv6, 
UDP, TCP, DHCP, and DNS. These protocols are likely to be encountered by sys¬ 
tem administrators and users alike. 

In the second edition, security is covered in two ways. First, in each appropriate 
chapter, a section devoted to describing known attacks and their countermeasures 
relating to the protocol described in the chapter is included. These descriptions 
are not presented as a recipe for constructing attacks but rather as a practical indi¬ 
cation of the kinds of problems that may arise when protocol implementations (or 
specifications, in some cases) are insufficiently robust. In today's Internet, incom¬ 
plete specification or lax implementation practice can lead to mission-critical sys¬ 
tems being compromised by even relatively unsophisticated attacks. 

The second important discussion of security occurs in Chapter 18, where 
security and cryptography are studied in some detail, including protocols such as 
IPsec, TLS, DNSSEC, and DKIM. These protocols are now understood to be impor¬ 
tant for implementing any service or application expected to maintain integrity 
or secure operation. As the Internet has increased in commercial importance, the 
need tor security (and the number of threats to it) has grown proportionally. 

Although IPv6 was not included in the first edition, there is now reason to 
believe that the use of IPv6 may increase significantly with the exhaustion of 
unallocated IPv4 address groups in February 2011. IPv6 was conceived largely 
to address the problems of IPv4 address depletion and, and while not nearly as 
common as IPv4 today, is becoming more important as a growing number of 
small devices (such as cellular telephones, household devices, and environmental 
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sensors) become attached to the Internet. Events such as the World IPv6 Day (June 
8, 2011) helped to demonstrate that the Internet can continue to work even as the 
underlying protocols are modified and augmented in a significant way 

A second consideration for fhe sfrucfure of fhe second edifion is a deemphasis 
of fhe protocols fhaf are no longer commonly used and an update of fhe descrip- 
fions of fhose fhaf have been revised subsfanfially since fhe publicafion of fhe 
firsf edifion. The chapfers covering RARP, BOOTP, NFS, SMTP, and SNMP have 
been removed from fhe book, and fhe discussion of fhe SLIP protocol has been 
abandoned in favor of expanded coverage of DHCP and PPP (including PPPoE). 
The funcfion of IP forwarding (described in Chapfer 9 in fhe firsf edifion) has 
been infegrafed wifh fhe overall descripfion of fhe IPv4 and IPv6 protocols in 
Chapfer 5 of fhis edifion. The discussion of dynamic roufing protocols (RIP, OSPF, 
and BGP) has been removed, as fhe laffer fwo protocols alone could each conceiv¬ 
ably merif a book-long discussion. Sfarfing wifh ICMP, and confinuing fhrough IP, 
TCP, and UDP, fhe impacf of operafion using IPv4 versus IPv6 is discussed in any 
cases where fhe difference in operafion is significanf. There is no specific chapfer 
devoted solely fo IPv6; insfead, ifs impacf relafive fo each exisfing core protocol is 
described where appropriate. Chapfers 15 and 25-30 of fhe firsf edifion, which are 
devoted fo Infernef applicafions and fheir supporfing profocols, have been largely 
removed; whaf remains only illusfrafes fhe operafion of fhe underlying core pro¬ 
focols where necessary. 

Several chapfers covering new material have been added. The firsf chapfer 
begins wifh a general infroducfion fo nefworking issues and archifecfure, followed 
by a more Infernef-specific orienfafion. The Infernef's addressing archifecfure is 
covered in Chapfer 2. A new chapfer on hosf configurafion and how a sysfem "gefs 
on" fhe nefwork appears as Chapfer 6. Chapfer 7 describes firewalls and Nefwork 
Address Translafion (NAT), including how NATs are used in parfifioning address 
space befween roufable and nonroufable porfions. The sef of fools used in fhe firsf 
edifion has been expanded fo include Wireshark (a free nefwork fraffic monifor 
applicafion wifh a graphical user inf erf ace). 

The fargef readership for fhe second edifion remains idenfical fo fhaf of fhe 
firsf edifion. No prior knowledge of nefworking concepfs is required for approach¬ 
ing if, alfhough fhe advanced reader should benefif from fhe level of defail and 
references. A rich collecfion of references is included in each chapfer for fhe infer- 
esfed reader fo pursue. 

Editorial Changes for the Second Edition 

The general flow of maferial in fhe second edifion remains similar fo fhaf of fhe 
firsf edifion. Affer fhe infroducfory maferial (Chapfers 1 and 2), fhe profocols are 
presenfed in a boffom-up fashion fo illusfrafe how fhe goal of nefwork communi- 
cafion presenfed in fhe infroducfion is realized in fhe Infernef archifecfure. As in 
fhe firsf edifion, acfual packef fraces are used fo illusfrafe fhe operafional defails 
of fhe profocols, where appropriafe. Since fhe publicafion of fhe firsf edifion, freely 
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available packet capture and analysis tools with graphical interfaces have become 
available, extending the capabilities of fhe tcpdump program used in fhe firsf 
edifion. In fhe presenf fexf, tcpdump is used when fhe poinfs fo be illusfrafed 
are easily conveyed by examining fhe oufpuf of a fexf-based packef capfure fool. 
In mosf ofher cases, however, screen shofs of fhe Wireshark fool are used. Please 
be aware fhaf some oufpuf lisfings, including snapshofs of tcpdump oufpuf, are 
wrapped or simplified for clarify. 

The packef fraces shown fypically illusfrafe fhe behavior of one or more parfs 
of fhe nefwork depicfed on fhe inside of fhe fronf book cover. If represenfs a broad- 
band-connecfed "home" environmenf (fypically used for clienf access or peer-fo- 
peer nefworking), a "public" environmenf (e.g., coffee shop), and an enferprise 
environmenf. The operafing sysfems used for examples include Linux, Windows, 
FreeBSD, and Mac OS X. Various versions are used, as many differenf OS versions 
are in use on fhe Infernef foday. 

The sfrucfure of each chapfer has been slighfly modified from fhe firsf edi¬ 
fion. Each chapfer begins wifh an infroducfion fo fhe chapfer topic, followed in 
some cases by historical notes, fhe defails of fhe chapfer, a summary, and a sef of 
references. A secfion near fhe end of mosf chapters describes securify concerns 
and affacks. The per-chapfer references represenf a change for fhe second edifion. 
They should make each chapfer more self-confained and require fhe reader fo 
perform fewer "long-disfance page jumps" fo find a reference. Some of fhe refer¬ 
ences are now enhanced wifh WWW URLs for easier access online. In addifion, 
fhe reference formaf for papers and books has been changed fo a somewhaf more 
compacf form fhaf includes fhe firsf inifial of each aufhor's lasf name followed by 
fhe lasf fwo digifs of fhe year (e.g., fhe former [Cerf and Kahn 1974] is now shorf- 
ened fo [CK74]). For fhe numerous RFC references used, fhe RFC number is used 
instead of fhe aufhor names. This follows fypical RFC convenfions and has fhe 
side benefif of grouping all fhe RFC references fogefher in fhe reference lisfs. 

On a final note, fhe fypographical convenfions of fhe TCP/IP Illusfrafed series 
have been mainfained faifhfully. However, fhe presenf aufhor elected fo use an 
editor and fypeseffing package ofher fhan fhe Troff sysfem used by Dr. Sfevens 
and some ofher aufhors of fhe Addison-Wesley Professional Compufing Series col- 
lecfion. Thus, fhe parficular fask of final copyedifing could fake advanfage of fhe 
significanf experfise of Barbara Wood, fhe copy edifor generously made available 
fo me by fhe publisher. We hope you will be pleased wifh fhe resulfs. 


Berkeley, California 
September 2011 


Kevin R. Fall 
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Introduction 

This book describes the TCP/IP protocol suite, but from a different perspective 
than other texts on TCP/IP. Instead of jusf describing fhe protocols and whaf fhey 
do, weTl use a popular diagnosfic fool fo wafch fhe protocols in acfion. Seeing how 
fhe protocols operate in varying circumsfances provides a greater undersfanding 
of how fhey work and why cerfain design decisions were made. If also provides 
a look info fhe implemenfafion of fhe profocols, wifhouf having fo wade fhrough 
fhousands of lines of source code. 

When nefworking profocols were being developed in fhe 1960s fhrough 
fhe 1980s, expensive, dedicafed hardware was required fo see fhe packefs going 
"across fhe wire." Exfreme familiarify wifh fhe profocols was also required fo 
comprehend fhe packefs displayed by fhe hardware. Funcfionalify of fhe hard¬ 
ware analyzers was limifed fo fhaf builf in by fhe hardware designers. 

Today fhis has changed dramafically wifh fhe abilify of fhe ubiquitous work- 
sfafion fo monitor a local area nefwork [Mogul 1990]. Jusf affach a worksfafion fo 
your nefwork, run some publicly available soffware, and wafch whaf goes by on 
fhe wire. While many people consider fhis a fool fo be used for diagnosing nefwork 
problems, if is also a powerful fool for understanding how fhe nefwork profocols 
operate, which is fhe goal of fhis book. 

This book is intended for anyone wishing fo undersfand how fhe TCP/IP pro¬ 
focols operafe: programmers wrifing nefwork applicafions, sysfem adminisfrafors 
responsible for mainfaining compufer systems and nefworks ufilizing TCP/IP, 
and users who deal wifh TCP/IP applicafions on a daily basis. 
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Typographical Conventions 

When we display interactive input and output we'll show our typed input in a 
bold font, and the computer output like this. Comments are added in italics. 


bsdi % telnet svr4 discard connect to the discard server 

Trying 140.252.13.34. . . this line and next output by Telnet client 

Connected to svr4. 


Also, we always include the name of the system as part of fhe shell prompf (bsdi 
in fhis example) fo show on which hosf fhe command was run. 


Note 

Throughout the text we’ll use Indented, parenthetical notes such as this to 
describe historical points or implementation details. 


We sometimes refer to the complete description of a command on the Unix man¬ 
ual as in ifconfig(8). This notation, the name of the command followed by a 
number in parentheses, is the normal way of referring to Unix commands. The 
number in parentheses is the section number in the Unix manual of the "manual 
page" for the command, where additional information can be located. Unfortu¬ 
nately not all Unix systems organize their manuals the same, with regard to the 
section numbers used for various groupings of commands. We'll use the BSD- 
style section numbers (which is the same for BSD-derived systems such as SunOS 
4.1.3), but your manuals may be organized differently. 
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Introduction 


Effective communication depends on the use of a common language. This is true 
for humans and other animals as well as for computers. When a set of common 
behaviors is used with a common language, a protocol is being used. The first defi¬ 
nition of a protocol, according to the New Oxford American Dictionary, is 

The official procedure or system of rules governing affairs of state or diplomatic 

occasions. 

We engage in many protocols every day: asking and responding to questions, 
negotiating business transactions, working collaboratively, and so on. Computers 
also engage in a variety of protocols. A collection of related protocols is called a 
protocol suite. The design that specifies how various protocols of a protocol suite 
relate to each other and divide up tasks to be accomplished is called the architec¬ 
ture or reference model for the protocol suite. TCP/IP is a protocol suite that imple¬ 
ments the Internet architecture and draws its origins from the ARPANET Reference 
Model (ARM) [RFC0871]. The ARM was itself influenced by early work on packet 
switching in the United States by Paul Baran [B64] and Leonard Kleinrock [K64], 
in the U.K. by Donald Davies [DBSW66], and in France by Louis Pouzin [P73]. 
Other protocol architectures have been specified over the years (e.g., the ISO pro¬ 
tocol architecture [Z80], Xerox's XNS [X85], and IBM's SNA [196]), but TCP/IP has 
become the most popular. There are several interesting books that focus on the 
history of computer communications and the development of the Internet, such as 
[P07] and [W02]. 

It is worth mentioning that the TCP/IP architecture evolved from work that 
addressed a need to provide interconnection of multiple different packet-switched 
computer networks [CK74]. This was accomplished using a set oi gateways (later 
called routers) that provided a translation function between each otherwise incom¬ 
patible network. The resulting "concatenated" network or catenet (later called inter¬ 
network) would be much more useful, as many more nodes offering a wide variety 
of services could communicate. The types of uses that a global network might 
offer were envisioned years before the protocol architecture was fully developed. 
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In 1968, for example, J. C. R. Licklider and Bob Taylor foresaw fhe pofenfial uses 
for a global inferconnecfed communicafion nefwork fo supporf "supercommuni- 
fies" [LT68]: 

Today the on-line communities are separated from one another functionally as 
well as geographically. Each member can look only to the processing, storage and 
software capability of the facility upon which his community is centered. But 
now the move is on to interconnect the separate communities and thereby trans¬ 
form them into, let us call it, a supercommunity. The hope is that interconnection 
will make available to all members of all the communities the programs and data 
resources of the entire supercommunity ... The whole will constitute a labile net¬ 
work of networks—ever-changing in both content and configuration. 

Thus, it is apparent that the global network concept underpinning the ARPA¬ 
NET and later the Internet was designed to support many of the types of uses we 
enjoy today. However, getting to this point was neither simple nor obvious. The 
success resulted from paying careful attention to design and engineering, innova¬ 
tive users and developers, and the availability of sufficient resources to move from 
concept to prototype and, eventually, to commercial networking products. 

This chapter provides an overview of the Internet architecture and TCP/IP 
protocol suite, to provide some historical context and to establish an adequate 
background for the remaining chapters. Architectures (both protocol and physi¬ 
cal) really amount to a set of design decisions about what features should be sup¬ 
ported and where such features should be logically implemented. Designing an 
architecture is more art than science, yet we shall discuss some characteristics of 
architectures that have been deemed desirable over time. The subject of network 
architecture has been undertaken more broadly in the text by Day [D08], one of 
few such treatments. 


1.1 Architectural Principles 

The TCP/IP protocol suite allows computers, smartphones, and embedded devices 
of all sizes, supplied from many different computer vendors and running totally 
different software, to communicate with each other. By the turn of the twenty-first 
century it has become a necessity for modern communication, entertainment, and 
commerce. It is truly an open system in that the definition of the protocol suite and 
many of its implementations are publicly available at little or no charge. It forms 
the basis for what is called the global Internet, or the Internet, a wide area network 
(WAN) of about two billion users that literally spans the globe (as of 2010, about 
30% of the world's population). Although many people consider the Internet and 
the World Wide Web (WWW) to be interchangeable terms, we ordinarily refer to 
the Internet in terms of its ability to provide basic communication of messages 
between computers. We refer to WWW as an application that uses the Internet for 
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communication. It is perhaps the most important Internet application that brought 
Internet technology to world attention in the early 1990s. 

Several goals guided the creation of the Internet architecture. In [C88], Clark 
recounts that the primary goal was to "develop an effective technique for mul- 
fiplexed ufilizafion of exisfing inferconnecfed nefworks." The essence of fhis 
sfafemenf is fhaf fhe Infernef archifecfure should be able fo inferconnecf mulfiple 
disfincf nefworks and fhaf mulfiple acfivifies should be able fo run simulfane- 
ously on fhe resulfing inferconnecfed nefwork. Beyond fhis primary goal, Clark 
provides a lisf of fhe following second-level goals: 

• Infernef communicafion musf conf inue despife loss of nefworks or gafeways. 

• The Infernef musf supporf mulfiple fypes of communicafion services. 

• The Infernef archifecfure musf accommodafe a variefy of nefworks. 

• The Infernef archifecfure musf permif disfribufed managemenf of ifs 
resources. 

• The Infernef archifecfure musf be cosf-effecfive. 

• The Infernef archifecfure musf permif hosf affachmenf wifh a low level of 
efforf. 

• The resources used in fhe Infernef archifecfure musf be accounfable. 

Many of fhe goals lisfed could have been supporfed wifh somewhaf differenf 
design decisions from fhose ulfimafely selecfed. However, a few design opfions 
were gaining momenfum when fhese archifecfural principles were being formu- 
lafed fhaf influenced fhe designers in fhe parficular choices fhey made. We will 
menfion some of fhe more imporfanf ones and fheir consequences. 

1.1.1 Packets, Connections, and Datagrams 

Up to the 1960s, the concept of a network was based largely on the telephone net¬ 
work. It was developed to connect telephones to each other for the duration of a 
call. A call was normally implemented by establishing a connection from one party 
to another. Establishing a connection meant that a circuit (initially, a physical elec¬ 
trical circuit) was made between one telephone and another for the duration of a 
call. When the call was complete, the connection was cleared, allowing the circuit 
to be used by other users' calls. The call duration and identification of the connec¬ 
tion endpoints were used to perform billing of the users. When established, the 
connection provided each user a certain amount of bandwidth or capacity to send 
information (usually voice sounds). The telephone network progressed from its 
analog roots to digital, which greatly improved its reliability and performance. 
Data inserted into one end of a circuit follows some preestablished path through 
the network switches and emerges on the other side in a predictable fashion. 
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usually with some upper bound on the time (latency). This gives predictable ser¬ 
vice, as long as a circuit is available when a user needs one. Circuits allocate a 
pathway through the network that is reserved for the duration of a call, even if 
fhey are nof enfirely busy This is a common experience today wifh fhe phone 
nefwork—as long as a call is faking place, even if we are nof saying anyfhing, we 
are being charged for fhe fime. 

One of fhe imporfanf concepfs developed in fhe 1960s (e.g., in [B64]) was fhe 
idea of packet switching. In packef swifching, "chunks" (packefs) of digifal informa- 
fion comprising some number of byfes are carried fhrough fhe nefwork somewhaf 
independenfly. Chunks coming from differenf sources or senders can be mixed 
fogefher and pulled aparf lafer, which is called multiplexing. The chunks can be 
moved around from one swifch fo anofher on fheir way to a desfinafion, and 
fhe pafh mighf be subjecf fo change. This has fwo pofenfial advanfages: fhe nef¬ 
work can be more resilienf (fhe designers were worried abouf fhe nefwork being 
physically attacked), and fhere can be better ufilizafion of fhe nefwork links and 
swifches because of statistical multiplexing. 

When packefs are received af a packef swifch, fhey are ordinarily sfored in buf¬ 
fer memory or queue and processed in a first-come-first-served (FCFS) fashion. This 
is fhe simplesf mefhod for scheduling fhe way packefs are processed and is also 
called first-in-first-out (FIFO). FIFO buffer managemenf and on-demand schedul¬ 
ing are easily combined fo implemenf sfafisfical mulfiplexing, which is fhe pri¬ 
mary mefhod used fo Informix fraffic from differenf sources on fhe Infernef. In 
sfafisfical mulfiplexing, fraffic is mixed fogefher based on fhe arrival sfafisfics or 
fiming pattern of fhe fraffic. Such mulfiplexing is simple and efficienf, because if 
fhere is any nefwork capacify fo be used and fraffic fo use if, fhe nefwork will be 
busy (high ufilizafion) af every boffleneck or choke poinf. The downside of fhis 
approach is limited predicfabilify—fhe performance seen by any parficular appli- 
cafion depends on fhe sfafisfics of ofher applicafions fhaf are sharing fhe nefwork. 
Sfafisfical mulfiplexing is like a highway where fhe cars can change lanes and 
ulfimafely infersperse in such a way fhaf any poinf of consfricfion is as busy as if 
can be. 

Alfernafive techniques, such as time-division multiplexing (TDM) and static mul¬ 
tiplexing, fypically reserve a cerfain amounf of fime or ofher resources for dafa on 
each connecfion. Alfhough such fechniques can lead fo more predicfabilify, a fea- 
fure useful for supporfing consfanf bif rate felephone calls, fhey may nof fully ufi- 
lize fhe nefwork capacify because reserved bandwidfh may go unused. Note fhaf 
while circuifs are sfraighfforwardly implemenfed using TDM fechniques, virtual 
circuits (VCs) fhaf exhibif many of fhe behaviors of circuifs buf do nof depend on 
physical circuif swifches can be implemenfed afop connecfion-orienfed packefs. 
This is fhe basis for a profocol known as X.25 fhaf was popular unfil abouf fhe 
early 1990s when if was largely replaced wifh Frame Relay and ulfimafely digital 
subscriber line (DSL) fechnology and cable modems supporfing Infernef connecfiv- 
ify (see Chapfer 3). 



Section 1.1 Architectural Principles 


5 


The VC abstraction and connection-oriented packet networks such as X.25 
required some information or state to be stored in each switch for each connec- 
fion. The reason is fhaf each packef carries only a small bif of overhead informa- 
fion fhaf provides an index info a sfafe fable. For example, in X.25 fhe 12-bif logical 
channel identifier (LCI) or logical channel number (LCN) serves fhis purpose. Af each 
swifch, fhe LCI or LCN is used in conjuncfion wifh fhe per-flow state in each swifch 
fo defermine fhe nexf swifch along fhe pafh for fhe packef. The per-flow sfafe is 
esfablished prior fo fhe exchange of dafa on a VC using a signaling protocol fhaf 
supporfs connecfion esfablishmenf, clearing, and sfafus informafion. Such nef- 
works are consequenfly called connection-oriented. 

Connecfion-orienfed nefworks, whefher builf on circuifs or packefs, were fhe 
mosf prevalenf form of nefworking for many years. In fhe lafe 1960s, anofher opf ion 
was developed known as fhe datagram. Affribufed in origin fo fhe CYCLADES 
[P73] system, a dafagram is a special fype of packef in which all fhe idenfify- 
ing informafion of fhe source and final desfinafion resides inside fhe packef ifself 
(instead of in fhe packef swifches). Alfhough fhis fends fo require larger packefs, 
per-connecfion sfafe af packef swifches is no longer required and a connectionless 
nefwork could be builf, eliminafing fhe need for a (complicafed) signaling profo- 
col. Dafagrams were eagerly embraced by fhe designers of fhe early Infernef, and 
fhis decision had profound implicafions for fhe resf of fhe profocol suite. 

One ofher related concepf is fhaf of message boundaries or record markers. As 
shown in Figure 1-1, when an applicafion sends more fhan one chunk of infor¬ 
mafion info fhe nefwork, fhe facf fhaf more fhan one chunk was wriffen may or 
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Figure 1-1 Applications write messages that are carried in protocols. A message boundary is the position or 
byte offset between one write and another. Protocols that preserve message boundaries indicate 
the position of the sender's message boundaries at the receiver. Protocols that do not preserve 
message boundaries (e.g., streaming protocols like TCP) ignore this information and do not make 
it available to a receiver. As a result, applications may need to implement their own methods to 
indicate a sender's message boundaries if this capability is required. 
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may not be preserved by the communication protocol. Most datagram protocols 
preserve message boundaries. This is natural because the datagram itself has a 
beginning and an end. However, in a circuit or VC network, it is possible that an 
application may write several chunks of dafa, all of which are read fogefher as one 
or more differenf-size chunks by a receiving applicafion. These fypes of protocols 
do nof preserve message boundaries. In cases where an underlying protocol fails 
to preserve message boundaries buf fhey are needed by an applicafion, fhe appli¬ 
cafion musf provide ifs own. 

1.1.2 The End-to-End Argument and Fate Sharing 

When large systems such as an operating system or protocol suite are being 
designed, a question often arises as to where a particular feature or function 
should be placed. One of the most important principles that influenced the design 
of the TCP/IP suite is called the end-to-end argument [SRC84]: 

The function in question can completely and correctly be implemented only with 
the knowledge and help of the application standing at the end points of the com¬ 
munication system. Therefore, providing that questioned function as a feature of 
the communication itself is not possible. (Sometimes an incomplete version of the 
function provided by the communication system may be useful as a performance 
enhancement.) 

This argument may seem fairly straightforward upon first reading but can 
have profound implications for communication system design. It argues that cor¬ 
rectness and completeness can be achieved only by involving the application or 
ultimate user of the communication system. Efforts to correctly implement what 
the application is "likely" to need are doomed to incompleteness. In short, this 
principle argues that important functions (e.g., error control, encryption, delivery 
acknowledgment) should usually not be implemented at low levels (or layers; see 
Section 1.2.1) of large systems. However, low levels may provide capabilities that 
make the job of the endpoints somewhat easier and consequently may improve 
performance. A nuanced reading reveals that this argument suggests that low- 
level functions should not aim for perfection because a perfect guess at what the 
application may require is unlikely to be possible. 

The end-to-end argument tends to support a design with a "dumb" network 
and "smart" systems connected to the network. This is what we see in the TCP/IP 
design, where many functions (e.g., methods to ensure that data is not lost, con¬ 
trolling the rate at which a sender sends) are implemented in the end hosts where 
the applications reside. The selection of which functions are implemented together 
in the same computer or network or software stack is the subject of another related 
principle known as fate sharing [C88]. 

Fate sharing suggests placing all the necessary state to maintain an active 
communication association (e.g., virtual connection) at the same location with 
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the communicating endpoints. With this reasoning, the only type of failure fhaf 
desfroys communicafion is one fhaf also desfroys one or more of fhe endpoinfs, 
which obviously desfroys fhe overall communicafion anyhow. Fafe sharing is one 
of fhe design philosophies fhaf allows virfual conned ions (e.g., fhose implemenfed 
by TCP) fo remain acfive even if connecfivify wifhin fhe nefwork has failed for a 
(modesf) period of fime. Fafe sharing also supporfs a "dumb nefwork wifh smarf 
end hosfs" model, and one of fhe ongoing fensions in foday's Infernef is whaf 
funcfions reside in fhe nefwork and whaf funcfions do nof. 

1.1.3 Error Control and Flow Control 

There are some circumstances where data within a network gets damaged or lost. 
This can be for a variety of reasons such as hardware problems, radiation that 
modifies bits while being transmitted, being out of range in a wireless network, 
and other factors. Dealing with such errors is called error control, and it can be 
implemented in the systems constituting the network infrastructure, or in the sys¬ 
tems that attach to the network, or some combination. Naturally, the end-to-end 
argument and fate sharing would suggest that error control be implemented close 
to or within applications. 

Usually, if a small number of bit errors are of concern, a number of mathemati¬ 
cal codes can be used to detect and repair the bit errors when data is received or 
while it is in transit [LC04]. This task is routinely performed within the network. 
When more severe damage occurs in a packet network, entire packets are usu¬ 
ally resent or retransmitted. In circuit-switched or VC-switched networks such as 
X.25, retransmission tends to be done inside the network. This may work well for 
applications that require strict in-order, error-free delivery of their data, but some 
applications do not require this capability and do not wish to pay the costs (such 
as connection establishment and potential retransmission delays) to have their 
data reliably delivered. Even a reliable file transfer application does not really care 
in what order the chunks of file data are delivered, provided it is eventually satis¬ 
fied that all chunks are delivered without errors and can be reassembled back into 
the original order. 

As an alternative to the overhead of reliable, in-order delivery implemented 
within the network, a different type of service called best-effort delivery was 
adopted by Frame Relay and the Internet Protocol. With best-effort delivery, the 
network does not expend much effort to ensure that data is delivered without 
errors or gaps. Certain types of errors are usually detected using error-detecting 
codes or checksums, such as those that might affect where a datagram is directed, 
but when such errors are detected, the errant datagram is merely discarded with¬ 
out further action. 

If best-effort delivery is successful, a fast sender can produce information at 
a rate that exceeds the receiver's ability to consume it. In best-effort IP networks, 
slowing down a sender is achieved hy flow control mechanisms that operate out¬ 
side the network and at higher levels of the communication system. In particular. 
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TCP handles this type of problem, and we shall discuss it in detail in Chapters 15 
and 16. This is consistent with the end-to-end argument: TCP, which resides at the 
end hosts, handles rate control. It is also consistent with fate sharing: the approach 
allows some elements of fhe nefwork infrasfrucfure fo fail wifhouf necessarily 
affecfing fhe abilify of fhe devices oufside fhe nefwork fo communicafe (as long as 
some communicafion pafh confinues fo operafe). 


1.2 Design and Implementation 

Alfhough a profocol archifecfure may suggesf a cerfain approach fo implemen- 
fafion, if usually does nof include a mandafe. Consequenfly we make a disfinc- 
fion befween fhe profocol archifecfure and fhe implementation architecture, which 
defines how fhe concepfs in a profocol archifecfure may be rendered info exis- 
fence, usually in fhe form of soffware. 

Many of fhe individuals responsible for implemenfing fhe protocols for fhe 
ARPANET were familiar wifh fhe soffware sfrucfuring of operafing sysfems, and 
an influenfial paper describing fhe "THE" mulfiprogramming system [D68] advo- 
cafed fhe use of a hierarchical sfrucfure as a way fo deal wifh verificafion of fhe 
logical soundness and correcfness of a large soffware implemenfafion. Ulfimafely, 
fhis confribufed fo a design philosophy for nefworking profocols involving mul- 
fiple layers of implemenfafion (and design). This approach is now called layering 
and is fhe usual approach fo implemenfing profocol suifes. 

1.2.1 Layering 

Wifh layering, each layer is responsible for a differenf facef of fhe communica- 
fions. Layers are beneficial because a layered design allows developers fo evolve 
differenf porfions of fhe sysfem separately, offen by differenf people wifh some- 
whaf differenf areas of experfise. The mosf frequenfly menfioned concepf of pro¬ 
focol layering is based on a sfandard called fhe Open Systems Interconnection (OSI) 
model [Z80] as defined by fhe Infernafional Organizafion for Sfandardizafion 
(ISO). Eigure 1-2 shows fhe sfandard OSI layers, including fheir names, numbers, 
and a few examples. The Infernef's layering model is somewhaf simpler, as we 
shall see in Secfion 1.3. 

Alfhough fhe OSI model suggesfs fhaf seven logical layers may be desirable 
for modularify of a profocol archifecfure implemenfafion, fhe TCP/IP archifec¬ 
fure is normally considered fo consisf of five. There was much debate abouf fhe 
relafive benefifs and deficiencies of fhe OSI model, and fhe ARPANET model fhaf 
preceded if, during fhe early 1970s. Alfhough if may be fair fo say fhaf TCP/IP 
ulfimafely "won," a number of ideas and even enfire profocols from fhe ISO pro¬ 
focol suife (profocols sfandardized by ISO fhaf follow fhe OSI model) have been 
adopfed for use wifh TCP/IP (e.g., IS-IS [RPC3787]). 
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Number 


Name 


Description/Example 


7 


Application 



6 


Presentation 



5 


Session 



4 


Transport 

3 


Network or 
Internetwork 



2 


Link 



1 


Physical 


Specifies methods for accomplishing some user-initiated task. Appiication-layer protocois tend to 
be devised and implemented by application developers. Examples include FTP, Skype, etc. 


Specifies methods for expressing data formats and translation rules for applications. A standard 
example would be conversion of EBCDIC to ASCII coding for characters (but of little concern 
today). Encryption is sometimes associated with this layer but can also be found at other layers. 


Specifies methods for multiple connections constituting a communication session. These may 
include closing connections, restarting connections, and checkpointing progress. ISO X.225 is a 
session-layer protocol. 


Specifies methods for connections or associations between multiple programs running on the 
same computer system. This layer may also implement reliable delivery if not implemented 
elsewhere (e.g., Internet TCP, ISO TP4). 


Specifies methods for communicating in a multihop fashion across potentially different types of 
link networks. For packet networks, describes an abstract packet format and its standard 
addressing structure (e.g., IP datagram, X.25 PLP, ISO CLNP). 


Specifies methods for communication across a single link, including “media access” control 
protocols when multiple systems share the same media. Error detection is commonly included at 
this layer, along with link-layer address formats (e.g., Ethernet, Wi-Fi, ISO 13239/HDLC). 


Specifies connectors, data rates, and how bits are encoded on some media. Also describes low- 
level error detection and correction, plus frequency assignments. We mostly stay clear of this 
layer in this text. Examples include V.92, Ethernet 1000BASE-T, SONET/SDH. 


Figure 1-2 The standard seven-layer OSI model as specified by the ISO. Not all protocols are implemented by 
every networked device (at least in theory). The OSI terminology and layer numbers are widely 
used. 


As described briefly in Figure 1-2, each layer has a different responsibility. 
From the bottom up, the physical layer defines methods for moving digital infor¬ 
mation across a communication medium such as a phone line or fiber-optic cable. 
Portions of the Ethernet and Wireless LAN (Wi-Fi) standards are here, although 
we do not delve into this layer very much in this text. The link or data-link layer 
includes those protocols and methods for establishing connectivity to a neighbor 
sharing the same medium. Some link-layer networks (e.g., DSL) connect only two 
neighbors. When more than one neighbor can access the same shared network, the 
network is said to be a multi-access network. Wi-Fi and Ethernet are examples of 
such multi-access link-layer networks, and specific protocols are used to mediate 
which stations have access to the shared medium at any given time. We discuss 
these in Chapter 3. 

Moving up the layer stack, the network or internetwork layer is of great interest 
to us. Eor packet networks such as TCP/IP, it provides an interoperable packet for¬ 
mat that can use different types of link-layer networks for connectivity. The layer 
also includes an addressing scheme for hosts and routing algorithms that choose 
where packets go when sent from one machine to another. Above layer 3 we find 
protocols that are (at least in theory) implemented only by end hosts, including 
the transport layer. Also of great interest to us, it provides a flow of data between 
sessions and can be quite complex, depending on the types of services it provides 
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(e.g., reliable delivery on a packet network that might drop data). Sessions rep¬ 
resent ongoing interactions between applications (e.g., when "cookies" are used 
with a Web browser during a Web login session), and session-layer protocols may 
provide capabilities such as connection initiation and restart, plus checkpointing 
(saving work that has been accomplished so far). Above the session layer we find 
fhe presentation layer, which is responsible for formaf conversions and sfandard 
encodings for informafion. As we shall see, fhe Infernef protocols do nof include a 
formal session or presenfafion profocol layer, so fhese funcfions are implemenfed 
by applicafions if needed. 

The fop layer is fhe application layer. Applicafions usually implemenf fheir 
own applicafion-layer profocols, and fhese are fhe ones mosf visible fo users. 
There is a wide variefy of applicafion-layer profocols, and programmers are con- 
sfanfly invenfing new ones. Consequenfly, fhe applicafion layer is where fhere is 
fhe greafesf amounf of innovafion and where new capabilifies are developed and 
deployed. 

1.2.2 Multiplexing, Demultiplexing, and Encapsulation in Layered 
Implementations 

One of fhe major benefifs of a layered archifecfure is ifs nafural abilify fo perform 
protocol multiplexing. This form of mulfiplexing allows mulfiple differenf profocols 
fo coexisf on fhe same infrasfrucfure. if also allows mulfiple insfanfiafions of fhe 
same profocol objecf (e.g., connecfions) fo be used simulfaneously wifhouf being 
confused. 

Mulfiplexing can occur af differenf layers, and af each layer a differenf sorf of 
identifier is used for determining which profocol or sfream of informafion belongs 
fogefher. For example, af fhe link layer, mosf link technologies (such as Efhernef 
and Wi-Fi) include a protocol identifier field value in each packef fo indicafe which 
profocol is being carried in fhe link-layer frame (IP is one such profocol). When 
an objecf (packef, message, efc.), called a protocol data unit (PDU), af one layer is 
carried by a lower layer, if is said fo be encapsulated (as opaque dafa) by fhe nexf 
layer down. Thus, mulfiple objecfs af layer N can be mulfiplexed fogefher using 
encapsulafion in layer N - 1. Figure 1-3 shows how fhis works. The idenfifier af 
layer N -1 is used fo determine fhe correcf receiving profocol or program af layer 
N during demulfiplexing. 

In Figure 1-3, each layer has ifs own concepf of a message objecf (a PDU) corre¬ 
sponding fo fhe parficular layer responsible for creafing if. For example, if a layer 
4 (fransporf) profocol produces a packef, if would properly be called a layer 4 PDU 
or transport PDU (TPDU). When a layer is provided a PDU from fhe layer above if, 
if usually "promises" fo nof look info fhe confenfs of fhe PDU. This is fhe essence 
of encapsulafion—each layer freafs fhe dafa from above as opaque, uninferpre- 
fable informafion. Mosf commonly a layer prepends fhe PDU wifh ifs own header, 
alfhough frailers are used by some profocols (nof TCP/IP). The header is used for 
mulfiplexing dafa when sending, and for fhe receiver fo perform demulfiplexing. 
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Layer Number 


Encapsulated Object 



- Front of PDU 

Figure 1-3 Encapsulation is usually used in conjunction with layering. Pure encapsulation involves 
taking the PDU of one layer and treating it as opaque (uninterpreted) data at the layer 
below. Encapsulation takes place at each sender, and decapsulation (the reverse opera¬ 
tion) takes place at each receiver. Most protocols use headers during encapsulation; a few 
also use trailers. 


based on a demultiplexing (demux) identifier. In TCP/IP networks such identifiers 
are commonly hardware addresses, IP addresses, and port numbers. The header 
may also include important state information, such as whether a virtual circuit is 
being set up or has already completed setup. The resulting object is another PDU. 

One other important feature of layering suggested by Figure 1-2 is that in pure 
layering not all networked devices need to implement all the layers. Figure 1-4 
shows that in some cases a device needs to implement only a few layers if it is 
expected to perform only certain types of processing. 

In Figure 1-4, a somewhat idealized small internet includes two end systems, a 
switch, and a router. In this figure, each number corresponds to a type of protocol 
at a particular layer. As we can see, each device implements a different subset of 
the layer stack. The host on the left implements three different link-layer protocols 
(D, E, and F) with corresponding physical layers and three different transport- 
layer protocols (A, B, and C) that run on a single type of network-layer protocol. 
End hosts implement all the layers, switches implement up to layer 2 (this switch 
implements D and G), and routers implement up to layer 3. Routers are capable 
of interconnecting different types of link-layer networks and must implement the 
link-layer protocols for each of the network types they interconnect. 
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Figure 1-4 Different network devices implement different subsets of the protocol stack. End hosts tend to 
implement all the layers. Routers implement layers below the transport layer, and switches imple¬ 
ment link-layer protocols and below. This idealized structure is often violated because routers and 
switches usually include the ability to act as a host (e.g., to be managed and set up) and therefore 
need an implementation of all of the layers even if they are rarely used. 


The internet of Figure 1-4 is somewhat idealized because today's switches and 
routers often implement more than the protocols they are absolutely required to 
implement for forwarding dafa. This is for a number of reasons, including man- 
agemenf. In such circumsfances, devices such as roufers and swifches musf some- 
fimes acf as hosfs and supporf services such as remofe login. To do fhis, fhey 
usually musf implemenf fransporf and applicafion protocols. 

Alfhough we show only fwo hosfs communicafing, fhe link- and physical- 
layer nefworks (labeled as D and G) mighf have mulfiple hosfs affached. If so, 
fhen communicafion is possible befween any pair of systems fhaf implemenf fhe 
appropriafe higher-layer protocols. In Figure 1-4 we can differenfiafe befween an 
end system (fhe fwo hosfs on eifher side) and an intermediate system (fhe roufer in 
fhe middle) for a parficular profocol suife. Layers above fhe nefwork layer use end- 
to-end protocols. In our picfure fhese layers are needed only on fhe end systems. 
The nefwork layer, however, provides a hop-by-hop profocol and is used on fhe fwo 
end systems and every infermediafe sysfem. The swifch or bridge is nof ordinarily 
considered an infermediafe sysfem because if is nof addressed using fhe infernef- 
working protocol's addressing formal, and if operafes in a fashion fhaf is largely 
fransparenf to fhe nefwork-layer profocol. From fhe poinf of view of fhe roufers 
and end sysfems, fhe swifch or bridge is essenfially invisible. 

A roufer, by definifion, has fwo or more nefwork interfaces (because if con- 
necfs fwo or more nefworks). Any sysfem wifh mulfiple inferfaces is called multi¬ 
homed. A hosf can also be mulfihomed, buf unless if specifically forwards packefs 
from one inferface to anofher, if is nof called a roufer. Also, roufers need nof be 
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special hardware boxes that only move packets around an internet. Most TCP/IP 
implementations, for example, allow a multihomed host to act as a router also, 
if properly configured to do so. In this case we can call the system either a host 
(when an application such as File Transfer Protocol (FTP) [RFC0959] or the Web is 
used) or a router (when it is forwarding packets from one network to another). We 
will use whichever term makes sense given the context. 

One of the goals of an internet is to hide all of the details of the physical lay¬ 
out (the topology) and lower-layer protocol heterogeneity from the applications. 
Although this is not obvious from our two-network internet in Figure 1-4, the 
application layers should not care (and do not care) that even though each host 
is attached to a network using link-layer protocol D (e.g., Ethernet), the hosts are 
separated by a router and switch that use link-layer G. There could be 20 rout¬ 
ers between the hosts, with additional types of physical interconnections, and the 
applications would run without modification (although the performance might be 
somewhat different). Abstracting the details in this way is what makes the con¬ 
cept of an internet so powerful and useful. 


1.3 The Architecture and Protocols of the TCP/IP Suite 

So far we have discussed architecture, protocols, protocol suites, and implemen¬ 
tation techniques in the abstract. In this section, we discuss the architecture and 
particular protocols that constitute the TCP/IP suite. Although this has become the 
established term for the protocols used on the Internet, there are many protocols 
beyond TCP and IP in the collection or family of protocols used with the Inter¬ 
net. We begin by noting how the ARPANET reference model of layering, which 
ultimately formed the basis for the Internet's protocol layering, differs somewhat 
from the OSI layering discussed earlier. 

1.3.1 The ARPANET Reference Model 

Eigure 1-5 depicts the layering inspired by the ARPANET reference model, which 
was ultimately adopted by the TCP/IP suite. The structure is simpler than the OSI 
model, but real implementations include a few specialized protocols that do not fit 
cleanly into the conventional layers. 

Starting from the bottom of Figure 1-5 and working our way up the stack, 
the first layer we see is 2.5, an "unofficial" layer. There are several protocols that 
operate here, but one of the oldest and most important is called the Address Reso¬ 
lution Protocol (ARP). It is a specialized protocol used with IPv4 and only with 
multi-access link-layer protocols (such as Ethernet and Wi-Fi) to convert between 
the addresses used by the IP layer and the addresses used by the link layer. We 
examine this protocol in Chapter 4. In IPv6 the address-mapping function is part 
of ICMPv6, which we discuss in Chapter 8. 
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Number 
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Virtually any Internet-compatible application, including the Web 
(HTTP). DNS (Chapter 11), DHCP (Chapters). 


Provides exchange of data between abstract “ports” managed by 
applications. May include error and flow control. Examples: TCP 
(Chapters 13-17), UDP (Chapter 10). SCTP, DCCP. 


Unofficial “layer” that helps accomplish setup, management, and 
security for the network layer. Examples: ICMP (Chapter 8) and 
IGMP (Chapters), IPsec (Chapter 18). 


Defines abstract datagrams and provides routing. Examples 
include IP (32-bit addresses, 64KB maximum size) and IPv6 
(128-bit addresses, up to 4GB maximum size). Chapters 2,5. 


Unofficial “layer" used to map addresses used at the network to 
those used at the link layer on multi-access link-layer networks. 
Example: ARP (Chapter 4). 



‘Network 

Layer” 


“Driver” 


Figure 1-5 Protocol layering based on the ARM or TCP/IP suite used in the Internet. There are no official ses¬ 
sion or presentation layers. In addition, there are several "adjunct" or helper protocols that do not 
fit well into the standard layers yet perform critical functions for the operation of the other proto¬ 
cols. Some of these protocols are not used with IPv6 (e.g., IGMP and ARP). 


At layer number 3 in Figure 1-5 we find IP, the main network-layer protocol 
for the TCP/IP suite. We discuss it in detail in Chapter 5. The PDU that IP sends to 
link-layer protocols is called an IP datagram and may be as large as 64KB (and up 
to 4GB for IPv6). In many cases we shall use the simpler term packet to mean an 
IP datagram when the usage context is clear. Fitting large packets into link-layer 
PDUs (called/rames) that may be smaller is handled by a function called fragmenta¬ 
tion that may be performed by IP hosts and some routers when necessary. In frag¬ 
mentation, portions of a larger datagram are sent in multiple smaller datagrams 
called fragments and put back together (called reassembly) when reaching the des¬ 
tination. We discuss fragmentation in Chapter 10. 

Throughout the text we shall use the term IP to refer to both IP versions 4 and 
6. We use the term IPv6 to refer to IP version 6, and IPv4 to refer to IP version 4, 
currently the most popular version. When discussing architecture, the details of 
IPv4 versus IPv6 matter little. When we delve into the way particular addressing 
and configuration functions work (Chapter 2 and Chapter 6), for example, these 
details will become more important. 

Because IP packets are datagrams, each one contains the address of the layer 
3 sender and recipient. These addresses are called IP addresses and are 32 bits long 
for IPv4 and 128 bits long for IPv6; we discuss them in detail in Chapter 2. This 
difference in IP address size is the characteristic that most differentiates IPv4 from 
IPv6. The destination address of each datagram is used to determine where each 
datagram should be sent, and the process of making this determination and send¬ 
ing the datagram to its next hop is called forwarding. Both routers and hosts per¬ 
form forwarding, although routers tend to do it much more often. There are three 
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types of IP addresses, and the type affects how forwarding is performed: unicast 
(destined for a single host), broadcast (destined for all hosts on a given network), 
and multicast (destined for a set of hosts that belong to a multicast group). Chapter 
2 looks at the types of addresses used with IP in more detail. 

The Internet Control Message Protocol (ICMP) is an adjunct to IP, and we label 
it as a layer 3.5 protocol. It is used by the IP layer to exchange error messages and 
other vital information with the IP layer in another host or router. There are two 
versions of ICMP: ICMPv4, used with IPv4, and ICMPv6, used with IPv6. ICMPv6 
is considerably more complex and includes functions such as address autocon¬ 
figuration and Neighbor Discovery that are handled by other protocols (e.g., ARP) 
on IPv4 networks. Although ICMP is used primarily by IP, it is also possible for 
applications to use it. Indeed, we will see that two popular diagnostic tools, ping 
and traceroute, use ICMP. ICMP messages are encapsulated within IP data¬ 
grams in the same way transport layer PDUs are. 

The Internet Group Management Protocol (ICMP) is another protocol adjunct to 
IPv4. It is used with multicast addressing and delivery to manage which hosts are 
members of a multicast group (a group of receivers interested in receiving traffic for 
a particular multicast destination address). We describe the general properties of 
broadcasting and multicasting, along with ICMP and the Multicast Listener Discov¬ 
ery protocol (MLD, used with IPv6), in Chapter 9. 

At layer 4, the two most common Internet transport protocols are vastly dif¬ 
ferent. The most widely used, the Transmission Control Protocol (TCP), deals with 
problems such as packet loss, duplication, and reordering that are not repaired 
by the IP layer. It operates in a connection-oriented (VC) fashion and does not 
preserve message boundaries. Conversely, the User Datagram Protocol (UDP) pro¬ 
vides little more than the features provided by IP. UDP allows applications to send 
datagrams that preserve message boundaries but imposes no rate control or error 
control. 

TCP provides a reliable flow of data between two hosts. It is concerned with 
things such as dividing the data passed to it from the application into appropri¬ 
ately sized chunks for the network layer below, acknowledging received packets, 
and setting timeouts to make certain the other end acknowledges packets that 
are sent, and because this reliable flow of data is provided by the transport layer, 
the application layer can ignore all these details. The PDU that TCP sends to IP is 
called a TCP segment. 

UDP, on the other hand, provides a much simpler service to the application 
layer. It allows datagrams to be sent from one host to another, but there is no 
guarantee that the datagrams reach the other end. Any desired reliability must 
be added by the application layer. Indeed, about all that UDP provides is a set 
of port numbers for multiplexing and demultiplexing data, plus a data integrity 
checksum. As we can see, UDP and TCP differ radically even though they are at 
the same layer. There is a use for each type of transport protocol, which we will 
see when we look at the different applications that use TCP and UDP. 
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There are two additional transport-layer protocols that are relatively new 
and available on some systems. As they are not yet very widespread, we do not 
devote much discussion to them, but they are worth being aware of. The first is the 
Datagram Congestion Control Protocol (DCCP), specified in [RFC4340]. If provides a 
fype of service midway befween TCP and UDP: connecfion-orienfed exchange of 
unreliable dafagrams buf wifh congesfion confrol. Congesfion confrol comprises 
a number of fechniques whereby a sender is limifed fo a sending rafe in order fo 
avoid overwhelming fhe nefwork. We discuss if in defail wifh respecf fo TCP in 
Chapfer 16. 

The of her fransporf profocol available on some sysfems is called fhe Stream 
Control Transmission Protocol (SCTP), specified in [RFC4960]. SCTP provides reli¬ 
able delivery like TCP buf does nof require fhe sequencing of dafa fo be sfricfly 
mainfained. If also allows for mulfiple sfreams fo logically be carried on fhe same 
connecfion and provides a message absfracfion, which differs from TCP. SCTP 
was designed for carrying signaling messages on IP nefworks fhaf resemble fhose 
used in fhe felephone nefwork. 

Above fhe fransporf layer, fhe applicafion layer handles fhe defails of fhe par- 
ficular applicafion. There are many common applicafions fhaf almosf every imple- 
menfafion of TCP/IP provides. The applicafion layer is concerned wifh fhe defails 
of fhe applicafion and nof wifh fhe movemenf of dafa across fhe nefwork. The 
lower fhree layers are fhe opposife: fhey know nofhing abouf fhe applicafion buf 
handle all fhe communicafion defails. 

1.3.2 Multiplexing, Demultiplexing, and Encapsulation in TCP/IP 

We have already discussed the basics of protocol multiplexing, demultiplexing, 
and encapsulation. At each layer there is an identifier that allows a receiving sys¬ 
tem to determine which protocol or data stream belongs together. Usually there is 
also addressing information at each layer. This information is used to ensure that 
a PDU has been delivered to the right place. Figure 1-6 shows how demultiplexing 
works in a hypothetical Internet host. 

Although it is not really part of the TCP/IP suite, we shall begin bottom-up 
and mention how demultiplexing from the link layer is performed, using Ethernet 
as an example. We discuss several link-layer protocols in Chapter 3. An arriving 
Ethernet frame contains a 48-bit destination address (also called a link-layer or 
MAC—Media Access Control—address) and a 16-bit field called the Ethernet type. 
A value of 0x0800 (hexadecimal) indicates that the frame contains an IPv4 data¬ 
gram. Values of 0x0806 and 0x86DD indicate ARP and IPv6, respectively. Assum¬ 
ing that the destination address matches one of the receiving system's addresses, 
the frame is received and checked for errors, and the Ethernet Type field value is 
used to select which network-layer protocol should process it. 

Assuming that the received frame contains an IP datagram, the Ethernet 
header and trailer information is removed, and the remaining bytes (which con¬ 
stitute the frame's payload) are given to IP for processing. IP checks a number of 
items, including the destination IP address in the datagram. If the destination 
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Figure 1-6 The TCP/IP stack uses a combination of addressing information and protocol demul¬ 
tiplexing identifiers to determine if a datagram has been received correctly and, if so, 
what entity should process it. Several layers also check numeric values (e.g., checksums) 
to ensure that the contents have not been damaged in transit. 


address matches one of its own and the datagram contains no errors in its header 
(IP does not check its payload), the 8-bit IPv4 Protocol field (called Next Header 
in IPv6) is checked to determine which protocol to invoke next. Common values 
include 1 (ICMP), 2 (ICMP), 4 (IPv4), 6 (TCP), and 17 (UDP). The value of 4 (and 
41, which indicates IPv6) is interesting because it indicates the possibility that an 
IP datagram may appear inside the payload area of an IP datagram. This violates 
the original concepts of layering and encapsulation but is the basis for a powerful 
technique known as tunneling, which we discuss more in Chapter 3. 

Once the network layer (IPv4 or IPv6) determines that the incoming datagram 
is valid and the correct transport protocol has been determined, the resulting data¬ 
gram (reassembled from fragments if necessary) is passed to the transport layer 
for processing. At the transport layer, most protocols (including TCP and UDP) 
use port numbers for demultiplexing to the appropriate receiving application. 

1.3.3 Port Numbers 

Port numbers are 16-bit nonnegative integers (i.e., range 0-65535). These numbers 
are abstract and do not refer to anything physical. Instead, each IP address has 
65,536 associated port numbers for each transport protocol that uses port numbers 
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(most do), and they are used for determining the correct receiving application. For 
client/server applications (see Section 1.5.1), a server first "binds" to a port num¬ 
ber, and subsequently one or more clients establish connections to the port num¬ 
ber using a particular transport protocol on a particular machine. In this sense, 
port numbers act more like telephone number extensions, except they are usually 
assigned by standards. 

Standard port numbers are assigned by the Internet Assigned Numbers 
Authority (lANA). The set of numbers is divided info special ranges, including fhe 
well-known porf numbers (0-1023), fhe registered porf numbers (1024-49151), and 
fhe dynamic/private porf numbers (49152-65535). Tradifionally, servers wishing fo 
bind fo (i.e., offer service on) a well-known porf require special privileges such as 
adminisfrafor or "roof" access. 

The range of well-known porfs is used for idenfifying many well-known ser¬ 
vices such as fhe Secure Shell Protocol (SSH, porf 22), FTP (porfs 20 and 21), Telnet 
remofe ferminal profocol (porf 23), e-mail/Simple Mail Transfer Protocol (SMTP, 
porf 25), Domain Name System (DNS, porf 53), fhe Hypertext Transfer Protocol or Web 
(FITTP and FITTPS, porfs 80 and 443), Interactive Mail Access Protocol (IMAP and 
IMAPS, porfs 143 and 993), Simple Network Management Protocol (SNMP, porfs 161 
and 162), Lightweight Directory Access Protocol (LDAP, porf 389), and several ofhers. 
Profocols wifh mulfiple porfs (e.g., PITTP and PITTPS) offen have differenf porf 
numbers depending on whefher Transport Layer Security (TLS) is being used wifh 
fhe base applicafion-layer profocol (see Chapfer 18). 


Note 

If we examine the port numbers for these standard services and other standard 
TCP/IP services (Telnet, FTP, SMTP, etc.), we see that most are odd numbers. 
This is historical, as these port numbers are derived from the NCP port numbers. 
(NCP, the Network Control Protocol, preceded TCP as a transport-layer protocol 
for the ARPANET.) NCP was simplex, not full duplex, so each application required 
two connections, and an even-odd pair of port numbers was reserved for each 
application. When TCP and UDP became the standard transport layers, only a 
single port number was needed per application, yet the odd port numbers from 
NCP were used. 


The registered port numbers are available to clients or servers with special 
privileges, but lANA keeps a reserved registry for particular uses, so these port 
numbers should generally be avoided when developing new applications unless 
an lANA allocation has been procured. The dynamic/private port numbers are 
essentially unregulated. As we will see, in some circumstances (e.g., on clients) 
the value of the port number matters little because the port number being used 
is transient. Such port numbers are also called ephemeral port numbers. They are 
considered to be temporary because a client typically needs one only as long as the 
user running the client needs service, and the client does not need to be found by 
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the server in order to establish a connection. Servers, conversely, generally require 
names and port numbers that do not change often in order to be found by clienfs. 

1.3.4 Names, Addresses, and the DNS 

Wifh TCP/IP, each link-layer inferface on each compufer (including roufers) has 
af leasf one IP address. IP addresses are enough fo idenfify a hosf, buf fhey are 
nof very convenienf for humans fo remember or manipulafe (especially fhe long 
addresses used wifh IPv6). In fhe TCP/IP world, fhe DNS is a disfribufed dafabase 
fhaf provides fhe mapping befween hosf names and IP addresses (and vice versa). 
Names are sef up in a hierarchy, ending in domains such as .com, .org, .gov, .in, 
.uk, and .edu. Perhaps surprisingly, DNS is an applicafion-layer profocol and 
fhus depends on fhe ofher protocols in order fo operate. Alfhough mosf of fhe 
TCP/IP suite does nof use or care abouf names, fypical users (e.g., fhose using Web 
browsers) use names frequenfly, so if fhe DNS fails fo funcfion properly, normal 
Infernef access is effecfively disabled. Chapfer 11 looks info fhe DNS in defail. 

Applicafions fhaf manipulafe names can call a sfandard API funcfion (see 
Secfion 1.5.3) fo look up fhe IP address (or addresses) corresponding fo a given 
hosf's name. Similarly, a funcfion is provided fo do fhe reverse lookup—given an 
IP address, look up fhe corresponding hosf name. Mosf applicafions fhaf fake a hosf 
name as inpuf also fake an IP address. Web browsers supporf fhis capabilify. For 
example, fhe Uniform Resource Locators (URLs) http://131.243. 2 . 201/index, 
html and http ://[2001: 400:610:102:: c9] /index. html can be fyped info a Web 
browser and are bofh effecfively equivalenf fo http : //ee. Ibl . gov/index. html (af 
fhe fime of wrifing; fhe second example requires IPv6 connecfivify fo be successful). 


1.4 Internets, Intranets, and Extranets 

As suggested previously, fhe Infernef has developed as fhe aggregafe nefwork 
resulfing from fhe inferconnecfion of consfifuenf nefworks over fime. The lower¬ 
case internet means mulfiple nefworks connecfed fogefher, using a common profo¬ 
col suife. The uppercase Internet refers fo fhe collecfion of hosfs around fhe world 
fhaf can communicafe wifh each ofher using TCP/IP The Infernef is an infernef, 
buf fhe reverse is nof frue. 

One of fhe reasons for fhe phenomenal growfh in nefworking during fhe 
1980s was fhe realizafion fhaf isolafed groups of sfand-alone computers made 
liffle sense. A few sfand-alone sysfems were connecfed fogefher info a network. 
Alfhough fhis was a step forward, during fhe 1990s we realized fhaf separafe 
nefworks fhaf could nof interoperate were nof as valuable as a bigger nefwork 
fhaf could. This nofion is fhe basis for fhe so-called Mefcalfe's Law, which sfafes 
roughly fhaf fhe value of a compufer nefwork is proporfional fo fhe square of fhe 
number of connecfed endpoinfs (e.g., users or devices). The Infernef idea, and ifs 
supporfing protocols, would make possible fhe inferconnecfion of differenf nef¬ 
works. This decepfively simple concepf furns ouf fo be remarkably powerful. 
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The easiest way to build an internet is to connect two or more networks with 
a router. A router is often a special-purpose device for connecfing nefworks. The 
nice fhing abouf roufers is fhaf fhey provide connecfions fo many differenf fypes 
of physical nefworks: Efhernef, Wi-Fi, poinf-fo-poinf links, DSL, cable Infernef ser¬ 
vice, and so on. 


Note 

These devices are also called IP routers, but we will use the term router. Historically 
these devices were called gateways, and this term is used throughout much of the 
older TCP/IP literature. Today the term gateway is used for an application-layer 
gateway (ALG), a process that connects two different protocol suites (say, TCP/IP 
and IBM’s SNA) for one particular application (often electronic mail or file transfer). 


In recent years, other terms have been adopted for different configurations of 
internets using the TCP/IP protocol suite. An intranet is the term used to describe a 
private internetwork, usually run by a business or other enterprise. Most often, the 
intranet provides access to resources available only to members of the particular 
enterprise. Users may connect to their (e.g., corporate) intranet using a virtual private 
network (VPN). VPNs help to ensure that access to potentially sensitive resources in 
an intranet is made available only to authorized users, usually using the tunneling 
concept we mentioned previously. We discuss VPNs in more detail in Chapter 7. 

In many cases an enterprise or business wishes to set up a network containing 
servers accessible to certain partners or other associates using the Internet. Such 
networks, which also often involve the use of a VPN, are known as extranets and 
consist of computers attached outside the serving enterprise's firewall (see Chap¬ 
ter 7). Technically, there is little difference between an intranet, an extranet, and 
the Internet, but the usage cases and administrative policies are usually different, 
and therefore a number of these more specific terms have evolved. 


1.5 Designing Applications 

The network concepts we have touched upon so far provide a fairly simple service 
model [RFC6250]: moving bytes between programs running on different (or, occa¬ 
sionally, the same) computers. To do anything useful with this capability, we need 
networked applications that use the network for providing services or perform¬ 
ing computations. Networked applications are typically structured according to a 
small number of design patterns. The most common of these are client/server and 
peer-to-peer. 

1.5.1 Client/server 

Most network applications are designed so that one side is the client and the other 
side is the server. The server provides some type of service to clients, such as 
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access to files on the server host. We can categorize servers into two classes: itera¬ 
tive and concurrent. An iterative server iterates through the following steps: 

11. Wait for a client request to arrive. 

12. Process the client request. 

13. Send the response back to the client that sent the request. 

14. Go back to step II. 

The problem with an iterative server occurs when step 12 takes a long time. 
During this time no other clients are serviced. A concurrent server, on the other 
hand, performs the following steps: 

Cl. Wait for a client request to arrive. 

C2. Start a new server instance to handle this client's request. This may involve 
creating a new process, task, or thread, depending on what the underly¬ 
ing operating system supports. This new server handles one client's entire 
request. When the requested task is complete, the new server terminates. 
Meanwhile, the original server instance continues to C3. 

C3. Go back to step Cl. 

The advantage of a concurrent server is that the server just spawns other 
server instances to handle the client requests. Each client has, in essence, its own 
server. Assuming that the operating system allows multiprogramming (essen¬ 
tially all do today), multiple clients are serviced concurrently. The reason we cat¬ 
egorize servers, and not clients, is that a client normally cannot tell whether it is 
talking to an iterative server or a concurrent server. As a general rule, most servers 
are concurrent. 

Note that we use the terms client and server to refer to applications and not 
to the particular computer systems on which they run. The very same terms are 
sometimes used to refer to the pieces of hardware that are most often used to exe¬ 
cute either client or server applications. Although the terminology is thus some¬ 
what imprecise, it works well enough in practice. As a result, it is common to find 
a server (in the hardware sense) running more than one server (in the application 
sense). 

1.5.2 Peer-to-Peer 

Some applications are designed in a more distributed fashion where there is no 
single server. Instead, each application acts both as a client and as a server, some¬ 
times as both at once, and is capable of forwarding requests. Some very popular 
applications (e.g., Skype [SKYPE], BitTorrent [BT]) are of this form. These applica¬ 
tions are called peer-to-peer or p2p applications. A concurrent p2p application may 
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receive an incoming request, determine if it is able to respond to the request, and 
if nof forward fhe requesf on fo some ofher peer. Thus, fhe sef of p2p applicafions 
fogefher form a nefwork among applicafions, also called an overlay network. Such 
overlays are now commonplace and can be exfremely powerful. Skype, for exam¬ 
ple, has grown fo be fhe largesf carrier of infernafional felephone calls. According 
fo some esfimafes, BifTorrenf was responsible for more fhan half of all Infernef 
fraffic in 2009 [IPIS]. 

One of fhe primary problems in p2p nefworks is called fhe discovery problem. 
Thaf is, how does one peer find which ofher peer(s) can provide fhe dafa or service 
if wanfs in a nefwork where peers may come and go? This is usually handled by 
a boofsfrapping procedure whereby each clienf is inifially configured wifh fhe 
addresses and porf numbers of some peers fhaf are likely fo be operafing. Once 
connecfed, fhe new parficipanf learns of ofher acfive peers and, depending on fhe 
profocol, whaf services or files fhey provide. 

1.5.3 Application Programming Interfaces (APIs) 

Applications, whether p2p or client/server, need to express their desired network 
operations (e.g., make a connection, write or read data). This is usually supported 
by a host operating system using a networking application programming interface 
(API). The most popular API is called sockets or Berkeley sockets, indicating where it 
was originally developed [LJFK93]. 

This text is not a programming text, but occasionally we refer to a feature of 
TCP/IP and whether that feature is provided by the sockets API or not. All of the 
programming details with examples for sockets can be found in [SFR04]. Modi¬ 
fications to sockets intended for use with IPv6 are also described in a number 
of freely available online documents [RFC3493][RFC3542][RFC3678][RFC4584] 
[RFC5014]. 

1.6 Standardization Process 

Newcomers to the TCP/IP suite often wonder just who is responsible for specify¬ 
ing and standardizing the various protocols and how they operate. A number 
of organizations represent the answer to this question. The group with which 
we will most often be concerned is the Internet Engineering Task Force (IETF) 
[RFC4677]. This group meets three times each year in various locations around 
the world to develop, discuss, and agree on standards for the Internet's "core" 
protocols. Exactly what constitutes "core" is subject to some debate, but common 
protocols such as IPv4, IPv6, TCP, UDP, and DNS are clearly in the purview of 
IETF. Attendance at IETF meetings is open to anyone, but it is not free. 

IETF is a forum that elects leadership groups called the Internet Architec¬ 
ture Board (lAB) and the Internet Engineering Steering Group (lESG). The lAB is 
chartered to provide architectural guidance to activities in IETF and to perform a 
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number of other tasks such as appointing liaisons to other standards-defining orga¬ 
nizations (SDOs). The lESG has decision-making authority regarding the creation 
and approval of new standards, along with modifications to existing standards. 
The "heavy lifting" or detailed work is generally performed by IETF working 
groups that are coordinated by working group chairs who volunteer for this task. 

In addition to the IETF, there are two other important groups that interact 
closely with the IETF. The Internet Research Task Force (IRTF) explores protocols, 
architectures, and procedures that are not deemed mature enough for standard¬ 
ization. The chair of the IRTF is a nonvoting member of lAB. The lAB, in turn, 
works with the Internet Society (ISOC) to help influence and promote worldwide 
policies and education regarding Internet technologies and usage. 

1.6.1 Request for Comments (RFC) 

Every official standard in the Internet community is published as a Request for 
Comments, or RFC. RFCs can be created in a number of ways, and the publisher of 
RFCs (called the RFC editor) recognizes multiple document streams corresponding 
to the way an RFC has been developed. The current streams (as of 2010) include 
the IETF, lAB, IRTF, and independent submission streams. Prior to being accepted 
and published (permanently) as an RFC, documents exist as temporary Internet 
drafts while they receive comments and progress through the editing and review 
process. 

All RFCs are not standards. Only so-called standards-track category RFCs 
are considered to be official standards. Other categories include best current prac¬ 
tice (BCP), informational, experimental, and historic. It is important to realize that 
just because a document is an RFC does not mean that the IETF has endorsed it 
as any form of standard. Indeed, there exist RFCs on which there is significant 
disagreement. 

The RFCs range in size from a few pages to several hundred. Each is identi¬ 
fied by a number, such as RFC 1122, with higher numbers for newer RFCs. They 
are all available for free from a number of Web sites, including http: // www. r f c - 
editor, org. For historical reasons, RFCs are generally delivered as basic text files, 
although some RFCs have been reformatted or authored using more advanced file 
formats. 

A number of RFCs have special significance because they summarize, clarify, 
or interpret particular sets of other standards. For example, [RFC5000] defines 
the set of all other RFCs that are considered official standards as of mid-2008 (the 
most recent such RFC at the time of writing). An updated list is available at the 
current standards Web site [OIPSW]. The Host Requirements RFCs ([RFC1122] and 
[RFC1123]) define requirements for protocol implementations in Internet IPv4 
hosts, and the Router Requirements RFC [RFC1812] does the same for routers. The 
Node Requirements RFC [RFC4294] does both for IPv6 systems. 
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1.6.2 Other Standards 

Although the IETF is responsible for standardizing most of fhe protocols we dis¬ 
cuss in fhis fexf, ofher SDOs are responsible for defining protocols fhaf merif our 
affenfion. The mosf imporfanf of fhese groups include fhe Insfifufe of Elecfrical 
and Elecfronics Engineers (IEEE), fhe World Wide Web Consorfium (W3C), and 
fhe Infernafional Telecommunicafion Union (ITU). In fheir acfivifies relevanf fo 
fhis fexf, IEEE is concerned wifh sfandards below layer 3 (e.g., Wi-Fi and Efhernef), 
and W3C is concerned wifh applicafion-layer protocols, specifically fhose related 
fo Web fechnologies (e.g., HTML-based synfax). ITU, and more specifically ITU-T 
(formerly CCITT), sfandardizes protocols used wifhin fhe felephone and cellular 
nefworks, which is becoming an ever more imporfanf componenf of fhe Infernef. 


1.7 Implementations and Software Distributions 

The hisforical de facto sfandard TCP/IP implemenfafions were from fhe Computer 
Sysfems Research Group (CSRG) af fhe Universify of Galifornia, Berkeley. They 
were disfribufed wifh fhe 4.x BSD (Berkeley Soffware Disfribufion) sysfem and 
wifh fhe BSD Nefworking Releases unfil fhe mid-1990s. This source code has been 
fhe sfarfing poinf for many ofher implemenfafions. Today, each popular operafing 
sysfem has ifs own implemenfafion. In fhis fexf, we fend fo draw examples from 
fhe TGP/IP implemenfafions in Linux, Windows, and somefimes FreeBSD and 
Mac OS (bofh of which are derived from hisforical BSD releases). In mosf cases, 
fhe parficular implemenfafion maffers liffle. 

Figure 1-7 shows a chronology of fhe various BSD releases, indicafing fhe 
imporfanf TGP/IP feafures we cover in lafer chapters. If also shows fhe years when 
Linux and Windows began supporfing TGP/IP. The BSD Nefworking Releases 
shown in fhe second column were freely available public source code releases con- 
faining all of fhe nefworking code, bofh fhe profocols fhemselves and many of fhe 
applicafions and ufilifies (e.g., fhe Telnef remote ferminal program and FTP file 
fransfer program). 

By fhe mid-1990s, fhe Infernef and TGP/IP were well esfablished. All subse- 
quenf popular operafing sysfems supporf TGP/IP nafively. Research and devel- 
opmenf of new TGP/IP feafures, previously found firsf in BSD releases, are now 
fypically found firsf in Linux releases. Windows has recenfly implemenfed a new 
TGP/IP slack (sfarfing wifh Windows Visfa) wifh many new feafures and nafive 
IPv6 capabilify. Linux, FreeBSD, and Mac OS X also supporf IPv6 wifhouf setting 
any special configurafion opfions. 
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Required Licenses 


Wlnsock(1992) 
TCP/IP from Third Parties 


t 


Windows for Workgroups 3.11 (1994) 
Initial Version of TCP/IP Supplied 
by Microsoft (Wolverine) as Add-on 


Windows 95 (1995) 

Initial Integrated Version of TCP/IP 
Supplied by Microsoft 


Linux 0.98 (1992) 
Initial Version of TCP/IP 

Linux 0.99(1992-9) 
TCP/IP Bug Fixes 


Linux 1.0.0 (1994) 
More TCP/IP Bug Fixes 


License Free 


BSD Networking Software 
Release 1.0 (1989) 

Net/1 


BSD Networking Software ^— 
Release 2.0 (1991) 

Net/2 


4.4BSD-Lite (1994) ^ 
Net/3 


4.1aBSD (1981) 
Contained Experimental 
Version ofBBN’s TCP/IP 

i 

4.2BSD (1983) 

First Widely Available 
Release of TCP/IP 

i 

4.3BSD (1986) 

TCP Performance 
Improvements 

i 

4.3BSD Tahoe (1988) 
—— TCP Fast Retransmit 
and Congestion Control 

i 

4.3BSD Reno (1990) 
TCP Fast Recovery, 
Header Prediction, Header 
Compression in 
SLIP(Precursor to PPP), 
Routing Table Changes 


4.4BSD(-Encumbered) (1993) 
Multicast Support, 

“Long Fat Pipe” Mods 


Figure 1-7 The history of software releases supporting TCP/IP up to 1995. The various BSD releases pioneered 
the availability of TCP/IP In part because of legal uncertainties regarding the BSD releases in the 
early 1990s, Linux was developed as an alternative that was initially tailored for PC users. Micro¬ 
soft began supporting TCP/IP in Windows a couple of years later. 


1.8 Attacks Involving the Internet Architecture 

Throughout the text we shall briefly describe attacks and vulnerabilities that 
have been discovered in the design or implementation of fhe topic we are dis¬ 
cussing. Few attacks fargef fhe Infernef archifecfure as a whole. However, if is 
worfh observing fhaf fhe Infernef archifecfure delivers IP dafagrams based on 
desfinafion IP addresses. As a resulf, malicious users are able to inserf whatever 
IP address fhey choose info fhe source IP address field of each IP dafagram fhey 
send, an acfivify called spoofing. The resulfing dafagrams are delivered fo fheir 
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destinations, but it is difficult to perform attribution. Thaf is, if may be difficulf or 
impossible fo defermine fhe origin of a dafagram received from fhe Infernef. 

Spoofing can be combined wifh a variefy of ofher affacks seen periodically on 
fhe Infernef. Denial-of-service (DoS) affacks usually involve using so much of some 
imporfanf resource fhaf legifimafe users are denied service. For example, sending 
so many IP dafagrams fo a server fhaf if spends all of ifs fime jusf processing fhe 
incoming packefs and performing no ofher useful work is a type of DoS affack. 
Ofher DoS affacks may involve clogging fhe nefwork wifh so much fraffic fhaf 
no ofher packefs can be senf. This is offen accomplished by using many sending 
compufers, forming a distributed DoS (DDoS) affack. 

Unauthorized access affacks involve accessing informafion or resources in an 
unaufhorized fashion. This can be accomplished wifh a variefy of fechniques such 
as exploifing profocol implemenfafion bugs fo fake confrol of a sysfem (called 
Owning fhe sysfem and fuming if info a zombie or bot). If can also involve vari¬ 
ous forms of masquerading such as an affacker's agenf impersonafing a legifimafe 
user (e.g., by running wifh fhe user's credenfials). Some of fhe more pernicious 
affacks involve faking confrol of many remofe sysfems using malicious soffware 
(malware) and using fhem in a coordinafed, disfribufed fashion (called botnets). 
Programmers who infenfionally develop malware and exploif sysfems for (illegal) 
profif or ofher malicious purposes are generally called black hats. So-called white 
hats do fhe same sorfs of fechnical fhings buf nofify vulnerable parfies insfead of 
exploif fhem. 

One ofher concern wifh fhe Infernef archifecfure is fhaf fhe original Infernef 
profocols did nof perform any encrypfion in supporf of aufhenficafion, infegrify, 
or confidenfialify. Consequenfly, malicious users could usually ascerfain privafe 
informafion by merely observing packefs in fhe nefwork. Those wifh fhe abilify 
fo modify packefs in fransif could also impersonafe users or alfer fhe confenfs of 
messages. Alfhough fhese problems have been reduced significanfly fhanks fo 
encrypfion profocols (see Chapfer 18), old or poorly designed profocols are sfill 
somefimes used fhaf are vulnerable fo simple eavesdropping affacks. Given fhe 
prevalence of wireless nefworks, where if is relafively easy fo "sniff" fhe packefs 
senf by ofhers, such older or insecure profocols should be avoided. Nofe fhaf while 
encrypfion may be enabled af one layer (e.g., on a link-layer Wi-Fi nefwork), only 
hosf-fo-hosf encrypfion (IP layer or above) profecfs informafion across fhe mul- 
fiple nefwork segmenfs an IP dafagram is likely fo fraverse on ifs way fo ifs final 
desfinafion. 


1.9 Summary 

This chapfer has been a whirlwind four of concepfs in nefwork archifecfure and 
design in general, plus fhe TCP/IP profocol suife in parficular fhaf we discuss in 
defail in lafer chapfers. The Infernef archifecfure was designed fo inferconnecf 
differenf exisfing nefworks and provide for a wide range of services and profocols 
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operating simultaneously. Packet switching using datagrams was chosen for its 
robustness and efficiency. Securify and predicfable delivery of dafa (e.g., bounded 
lafency) were secondary concerns. 

Based on fheir undersfanding of layered and modular soffware design in 
operafing sysfems, fhe early implemenfers of fhe Infernef protocols adopfed a 
layered design fhaf employs encapsulafion. The fhree main layers in fhe TCP/IP 
profocol suife are fhe nefwork layer, fransporf layer, and applicafion layer, and we 
menfioned fhe differenf responsibilifies of each. We also menfioned fhe link layer 
because if relafes so closely wifh fhe TCP/IP suife. We shall discuss each in more 
defail in subsequenf chapters. 

In TCP/IP, fhe disfincfion befween fhe nefwork layer and fhe fransporf layer is 
crifical: fhe nefwork layer (IP) provides an unreliable dafagram service and musf 
be implemenfed by all sysfems addressable on fhe Infernef, whereas fhe fransporf 
layers (TCP and UDP) provide an end-fo-end service fo applicafions running on 
end hosfs. The primary fransporf layers differ radically. TCP provides in-ordered 
reliable sfream delivery wifh flow confrol and congesfion confrol. UDP provides 
essenfially no capabilifies beyond IP excepf porf numbers for demulfiplexing and 
an error defecfion mechanism. Unlike TCP, however, if supporfs mulficasf delivery. 

Addresses and demulfiplexing idenfifiers are used by each layer fo avoid con¬ 
fusing profocols or differenf associafions/connecfions of fhe same profocol. Link- 
layer mulfi-access nefworks offen use 48-bif addresses; IPv4 uses 32-bif addresses 
and IPv6 uses 128-bif addresses. The TCP and UDP fransporf profocols use dis- 
fincf sefs of porf numbers. Some porf numbers are assigned by sfandards, and ofh- 
ers are used temporarily, usually by clienf applicafions when communicafing wifh 
servers. Porf numbers do nof represenf anyfhing physical; fhey are merely used as 
a way for applicafions fhaf wanf fo communicafe fo rendezvous. 

Alfhough porf numbers and IP addresses are usually enough fo idenfify fhe 
locafion of a service on fhe Infernef, fhey are nof very convenienf for humans fo 
remember or use (especially IPv6 addresses). Consequenfly, fhe Infernef uses 
a hierarchical system of hosf names fhaf can be converfed fo IP addresses (and 
back) using DNS, a disfribufed dafabase applicafion running on fhe Infernef. DNS 
has become an essenfial componenf of fhe Infernef infrasfrucfure, and efforfs are 
under way fo make if more secure (see Chapfer 18). 

An infernef is a collecfion of nefworks. The common building block for an 
infernef is a roufer fhaf connecfs fhe nefworks af fhe IP layer. The "capifal-I" Infer¬ 
nef is an infernef fhaf spans fhe globe and inferconnecfs nearly fwo billion users 
(as of 2010). Private infernefs are called infranefs and are usually connecfed fo fhe 
Infernef using special devices (firewalls, discussed in Chapfer 10) fhaf affempf fo 
prevenf unaufhorized access. Exfranefs usually consisf of a subsef of an insfifu- 
fion's infranef fhaf is designed fo be accessed by parfners or affiliafes in a limited 
way. 

Nefworked applicafions are usually designed using a clienf/server or peer- 
fo-peer design paffern. Clienf/server is more popular and fradifional, buf peer- 
fo-peer designs have also seen fremendous success. Whafever fhe design paffern. 
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applications invoke APIs to perform networking tasks. The most common API for 
TCP/IP nefworks is called sockefs. If was provided wifh BSD UNIX disfribufions, 
soffware releases fhaf pioneered fhe use of TCP/IP. By fhe lafe 1990s fhe TCP/IP 
profocol suife and sockefs API were available on every popular operafing sysfem. 

Securify was nof a major design goal for fhe Infernef archifecfure. Defermin- 
ing where packefs originafe can be difficulf for a receiver, as end hosfs can easily 
spoof source IP addresses in unsecured IP dafagrams. Disfribufed DoS attacks 
also remain an ongoing challenge because vicfim end hosfs can be collecfed 
fogefher fo form bofnefs fhaf can carry ouf DDoS and ofher affacks, somefimes 
wifhouf fhe sysfem owners' knowledge. Finally, early Infernef protocols did little 
fo ensure privacy of sensifive informafion, buf mosf of fhose profocols are now 
deprecafed, and modern replacemenfs use encrypfion fo provide confidenfial and 
aufhenficafed communicafions befween hosfs. 
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2.1 Introduction 

This chapter deals with the structure of network-layer addresses used in the Inter¬ 
net, also known as IP addresses. We discuss how addresses are allocated and 
assigned to devices on the Internet, the way hierarchy in address assignment aids 
routing scalability, and the use of special-purpose addresses, including broadcast 
mulficasf, and anycasf addresses. We also discuss how fhe sfrucfure and use of 
IPv4 and IPv6 addresses differ. 

Every device connecfed fo fhe Infernef has af leasf one IP address. Devices 
used in privafe nefworks based on fhe TCP/IP protocols also require IP addresses. 
In eifher case, fhe forwarding procedures implemented by IP roufers (see Chapfer 
5) use IP addresses fo idenfify where fraffic is going. IP addresses also indicafe 
where fraffic has come from. IP addresses are similar in some ways fo telephone 
numbers, buf whereas felephone numbers are often known and used direcfly by 
end users, IP addresses are offen shielded from a user's view by fhe Infernef's DNS 
(see Chapfer 11), which allows mosf users fo use names insfead of numbers. Users 
are confronfed wifh manipulafing IP addresses when fhey are required fo sef up 
nefworks fhemselves or when fhe DNS has failed for some reason. To undersfand 
how fhe Infernef idenfifies hosfs and roufers and delivers fraffic befween fhem, 
we musf undersfand fhe rote of IP addresses. We are fherefore inferesfed in fheir 
adminisfrafion, sfrucfure, and uses. 

When devices are affached fo fhe global Infernef, fhey are assigned addresses 
fhaf musf be coordinated so as fo nof duplicate ofher addresses in use on fhe nef- 
work. For privafe nefworks, fhe IP addresses being used musf be coordinated fo 
avoid similar overlaps wifhin fhe privafe nefworks. Groups of IP addresses are 
allocated fo users and organizafions. The recipienfs of fhe allocated addresses fhen 
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assign addresses to devices, usually according to some network "numbering plan." 
For global Internet addresses, a hierarchical system of administrative entities helps 
in allocating addresses to users and service providers. Individual users typically 
receive address allocations from Internet service providers (ISPs) fhaf provide bofh 
fhe addresses and fhe promise of roufing fraffic in exchange for a fee. 


2.2 Expressing IP Addresses 

The vasf majorify of Infernef users who are familiar wifh IP addresses undersfand 
fhe mosf popular fype: IPv4 addresses. Such addresses are offen represenfed in 
so-called doffed-quad or doffed-decimal nofafion, for example, 165.195.130.107. 
The doffed-quad nofafion consisfs of four decimal numbers separafed by periods. 
Each such number is a nonnegafive infeger in fhe range [0, 255] and represenfs 
one-quarfer of fhe enfire IP address. The doffed-quad nofafion is simply a way of 
wrifing fhe whole IPv4 address—a 32-bif nonnegafive infeger used fhroughouf 
fhe Infernef sysfem—using convenienf decimal numbers. In many circumsfances 
we will be concerned wifh fhe binary sfrucfure of fhe address. A number of Infer¬ 
nef sifes, such as http://www.subnetmask.info and http://www.subnet- 
calculator, com, now confain calculators for converfing befween formafs of 
IP addresses and relafed informafion. Table 2-1 gives a few examples of IPv4 
addresses and fheir corresponding binary represenfafions, fo gef sfarfed. 


Table 2-1 Example IPv4 addresses written in dotted-quad and binary notation 


Dotted-Quad Representation 

Binary Representation 

o.o.o.o 

00000000 00000000 00000000 00000000 

1.2.3.4 

00000001 00000010 00000011 00000100 

10.0.0.255 

00001010 00000000 00000000 11111111 

165.195.130.107 

10100101 11000011 10000010 01101011 

255.255.255.255 

11111111 11111111 11111111 11111111 


In IPv6, addresses are 128 bits in length, four times larger than IPv4 addresses, 
and generally speaking are less familiar to most users. The conventional notation 
adopted for IPv6 addresses is a series of four hexadecimal ("hex," or base-16) num¬ 
bers called blocks or fields separated by colons. An example IPv6 address containing 
eight blocks would be written as 5f05:2000:80ad:5800:0058:0800:2023:ld71. Although 
not as familiar to users as decimal numbers, hexadecimal numbers make the task 
of converting to binary somewhat simpler. In addition, a number of agreed-upon 
simplifications have been standardized for expressing IPv6 addresses [RFC4291]: 
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1. Leading zeros of a block need not be written. In the preceding example, the 
address could have been written as 5f05:2000:80ad:5800:58:800:2023:ld71. 

2. Blocks of all zeros can be omitted and replaced by the notation For exam¬ 
ple, the IPv6 address 0:0:0:0:0:0:0:1 can be written more compactly as ::1. 
Similarly, the address 2001:0db8:0:0:0:0:0:2 can be written more compactly 
as 2001:db8::2. To avoid ambiguities, the :: notation may be used only once 
in an IPv6 address. 

3. Embedded IPv4 addresses represented in the IPv6 format can use a form 
of hybrid notation in which the block immediately preceding the IPv4 por¬ 
tion of the address has the value ffff and the remaining part of the address 
is formatted using dotted-quad. For example, the IPv6 address ::ffff:10.0.0.1 
represents the IPv4 address 10.0.0.1. This is called an IPv4-mapped IPv6 
address. 

4. A conventional notation is adopted in which the low-order 32 bits of the 
IPv6 address can be written using dotted-quad notation. The IPv6 address 
::0102:f001 is therefore equivalent to the address ::1.2.240.1. This is called 
an IPv4-compatible IPv6 address. Note that IPv4-compatible addresses are 
not the same as IPv4-mapped addresses; they are compatible only in the 
sense that they can be written down or manipulated by software in a way 
similar to IPv4 addresses. This type of addressing was originally required 
for transition plans between IPv4 and IPv6 but is now no longer required 
[RFC4291]. 

Table 2-2 presents some examples of IPv6 addresses and their binary representa¬ 
tions. 


Table 2-2 Examples of IPv6 addresses and their binary representations 


Hex Notation 

Binary Representation 

5f05:2000:80ad:5800:58:800:2023:ld71 

0101111100000101 0010000000000000 

1000000010101101 0101100000000000 

0000000001011000 0000100000000000 

0010000000100011 0001110101110001 


0000000000000000 0000000000000000 

0000000000000000 0000000000000000 

0000000000000000 0000000000000000 

0000000000000000 0000000000000001 

::1.2.240.1 or ::102:f001 

0000000000000000 0000000000000000 

0000000000000000 0000000000000000 

0000000000000000 0000000000000000 

0000000100000010 1111000000000001 
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In some circumstances (e.g., when expressing a URL containing an address) 
the colon delimiter in an IPv6 address may be confused with another separator 
such as the colon used between an IP address and a port number. In such circum¬ 
stances, bracket characters, [ and ], are used to surround the IPv6 address. For 
example, the URL 


http://[2001:0db8:85a3:08d3:1319:8a2e:0370:7344]:443/ 


refers fo porf number 443 on IPv6 hosf 2001:0db8:85a3:08d3:1319:8a2e:0370:7344 
using fhe HTTP/TCP/IPv6 profocols. 

The flexibilify provided by [RFC4291] resulfed in unnecessary confusion due 
fo fhe abilify fo represenf fhe same IPv6 address in mulfiple ways. To remedy fhis 
sifuafion, [RFC5952] imposes some rules fo narrow fhe range of opfions while 
remaining compafible wifh [RFC4291]. They are as follows: 

1. Leading zeros musf be suppressed (e.g., 2001:0db8::0022 becomes 
2001:db8::22). 

2. The :: consfrucf musf be used fo ifs maximum possible effecf (mosf zeros 
suppressed) buf nof for only 16-bif blocks. If mulfiple blocks confain equal- 
lengfh runs of zeros, fhe firsf is replaced wifh 

3. The hexadecimal digifs a fhrough f should be represenfed in lowercase. 

In mosf cases, we too will abide by fhese rules. 


2.3 Basic IP Address Structure 

IPv4 has 4,294,967,296 possible addresses in ifs address space, and IPv6 has 340,282,3 
66,920,938,463,463,374,607,431,768,211,456. Because of fhe large number of addresses 
(especially for IPv6), if is convenienf fo divide fhe address space info chunks. IP 
addresses are grouped by type and size. Most of fhe IPv4 address chunks are even- 
fually subdivided down fo a single address and used fo idenfify a single nefwork 
inferface of a compufer affached fo fhe Infernef or fo some private infranef. These 
addresses are called unicast addresses. Mosf of fhe IPv4 address space is unicasf 
address space. Mosf of fhe IPv6 address space is nof currenfly being used. Beyond 
unicasf addresses, ofher fypes of addresses include broadcasf, mulficasf, and 
anycasf, which may refer fo more fhan one inferface, plus some special-purpose 
addresses we will discuss lafer. Before we begin wifh fhe defails of fhe currenf 
address sfrucfure, if is useful fo undersfand fhe historical evolufion of IP addresses. 

2.3.1 Classful Addressing 

When fhe Infernef's address sfrucfure was originally defined, every unicasf IP 
address had a network porfion, fo idenfify fhe nefwork on which fhe inferface using 
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the IP address was to be found, and a host portion, used to identify the particular host 
on the network given in the network portion. Thus, some number of contiguous bits 
in the address became known as the net number, and remaining bits were known as 
the host number. At the time, most hosts had only a single network interface, so the 
terms interface address and host address were used somewhat interchangeably. 

With the realization that different networks might have different numbers of 
hosts, and that each host requires a unique IP address, a partitioning was devised 
wherein different-size allocation units of IP address space could be given out to 
different sites, based on their current and projected number of hosts. The parti¬ 
tioning of the address space involved five classes. Each class represented a differ¬ 
ent trade-off in the number of bits of a 32-bit IPv4 address devoted to the network 
number versus the number of bits devoted to the host number. Figure 2-1 shows 
the basic idea. 


Class 

A 

B 

C 

D 

E 


0 15 16 31 



Figure 2-1 The IPv4 address space was originally divided into five classes. Classes A, B, and C were 
used for assigning addresses to interfaces on the Internet (unicast addresses) and for 
some other special-case uses. The classes are defined by the first few bits in the address: 
0 for class A, 10 for class B, 110 for class C, and so on. Class D addresses are for multicast 
use (see Chapter 9), and class E addresses remain reserved. 


Here we see that the five classes are named A, B, C, D, and E. The A, B, and 
C class spaces were used for unicast addresses. If we look more carefully at this 
addressing structure, we can see how the relative sizes of the different classes and 
their corresponding address ranges really work. Table 2-3 gives this class struc¬ 
ture (sometimes called classful addressing structure). 
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Table 2-3 The original ("classful") IPv4 address space partitioning 


Class 

Address Range 

High- 

Order 

Bits 

Use 

Fraction 
of Total 

Number 
of Nets 

Number 
of Hosts 

A 

0.0.0.0-127.255.255.255 

0 

Unicast/special 

1/2 

128 

16,777,216 

B 

128.0.0.0-191.255.255.255 

10 

Unicast/special 

1/4 

16,384 

65,536 

C 

192.0.0.0-223.255.255.255 

110 

Unicast/special 

1/8 

2,097,152 

256 

D 

224.0.0.0-239.255.255.255 

1110 

Multicast 

1/16 

N/A 

N/A 

E 

240.0.0.0-255.255.255.255 

1111 

Reserved 

1/16 

N/A 

N/A 


The table indicates how the classful addressing structure was used primar¬ 
ily to have a way of allocafing unicasf address blocks of differenf sizes fo users. 
The parfifioning info classes induces a frade-off befween fhe number of available 
nefwork numbers of a given size and fhe number of hosfs fhaf can be assigned 
fo fhe given nefwork. For example, a sife allocafed fhe class A nefwork number 
18.0.0.0 (MIT) has 2^“’ possible addresses fo assign as hosf addresses (i.e., using 
IPv4 addresses in fhe range 18.0.0.0-18.255.255.255), buf fhere are only 127 class A 
nefworks available for fhe enfire Infernef. A sife allocafed a class C nefwork num¬ 
ber, say, 192.125.3.0, would be able fo assign only 256 hosfs (i.e., fhose in fhe range 
192.125.3.0-192.125.3.255), buf fhere are more fhan fwo million class C nefwork 
numbers available. 


Note 

These numbers are not exact. Several addresses are not generally available for 
use as unicast addresses, in particular, the first and last addresses of the range 
are not generally available. In our example, the site assigned address range 
18.0.0.0 would really be able to assign as many as 2^'*- 2 = 16,777,214 unicast IP 
addresses. 


The classful approach fo Infernef addressing lasfed mosfly infacf for fhe firsf 
decade of fhe Infernef's growfh (fo abouf fhe early 1980s). Affer fhaf, if began fo 
show ifs firsf signs of scaling problems—if was becoming too inconvenienf fo cen- 
frally coordinate fhe allocafion of a new class A, B, or C nefwork number every fime 
a new nefwork segmenf was added fo fhe Infernef. In addifion, assigning class A 
and B nefwork numbers fended fo wasfe too many hosf numbers, whereas class C 
nefwork numbers could nof provide enough hosf numbers fo many new sites. 

2.3.2 Subnet Addressing 

One of fhe earliesf difficulfies encounfered when fhe Infernef began fo grow was 
fhe inconvenience of having fo allocafe a new nefwork number for any new nef¬ 
work segmenf fhaf was fo be affached fo fhe Infernef. This became especially 
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cumbersome with the development and increasing use of local area networks 
(LANs) in the early 1980s. To address the problem, it was natural to consider a 
way that a site attached to the Internet could be allocated a network number cen¬ 
trally that could then be subdivided locally by site administrators. If this could be 
accomplished without altering the rest of the Internet's core routing infrastruc¬ 
ture, so much the better. 

Implementing this idea would require the ability to alter the line between the 
network portion of an IP address and the host portion, but only for local purposes 
at a site; the rest of the Internet would "see" only the traditional class A, B, and C 
partitions. The approach adopted to support this capability is called subnet address¬ 
ing [RFC0950]. Using subnet addressing, a site is allocated a class A, B, or C net¬ 
work number, leaving some number of remaining host bits to be further allocated 
and assigned within a site. The site may further divide the host portion of its base 
address allocation into a subnetwork (subnet) number and a host number. Essen¬ 
tially, subnet addressing adds one additional field to the IP address structure, but 
without adding any bits to its length. As a result, a site administrator is able to 
trade off the number of subnetworks versus the number of hosts expected to be on 
each subnetwork without having to coordinate with other sites. 

In exchange for the additional flexibility provided by subnet addressing, a 
new cost is imposed. Because the definition of the Subnet and Host fields is now 
site-specific (not dictated by the class of the network number), all routers and hosts 
at a site require a new way to determine where the Subnet field of the address and 
the Host field of the address are located within the address. Before subnets, this 
information could be derived directly by knowing whether a network number 
was from class A, B, or C (as indicated by the first few bits in the address). As an 
example, using subnet addressing, an IPv4 address might have the form shown in 
Figure 2-2. 


Centrally Locally Managed 

Allocated at Site 


Class 



I 

Subnet/Host 

Partition 


Figure 2-2 An example of a subnetted class B address. Using 8 bits for the subnet ID provides for 
256 subnets with 254 hosts on each of the subnets. This partitioning may be altered by 
the network administrator. 
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Figure 2-2 is an example of how a class B address might be "subnetted." 
Assume that some site in the Internet has been allocated a class B network num¬ 
ber. The first 16 bits of every address the site will use are fixed at some particular 
number because these bits have been allocated by a central authority. The last 16 
bits (which would have been used only to create host numbers in the class B net¬ 
work without subnets) can now be divided by the site network administrator as 
needs may dictate. In this example, 8 bits have been chosen for the subnet number, 
leaving 8 bits for host numbers. This particular configuration allows the site to 
support 256 subnetworks, and each subnetwork may contain up to 254 hosts (now 
the first and last addresses for each subnetwork are not available, as opposed to 
losing only the first and last addresses of the entire allocated range). Recall that 
the subnetwork structure is known only by hosts and routers where the subnet¬ 
ting is taking place. The remainder of the Internet still treats any address associ¬ 
ated with the site just as it did prior to the advent of subnet addressing. Figure 2-3 
shows how this works. 


Internet 



128 . 32 . 1.14 128 . 32 . 2.122 


Figure 2-3 A site is allocated the classical class B network number 128.32. The network administra¬ 
tor decides to apply a site-wide subnet mask of 255.255.255.0, giving 256 subnetworks 
where each subnetwork can hold 256 - 2 = 254 hosts. The IPv4 address of each host on 
the same subnet has the subnetwork number in common. All of the IPv4 addresses of 
hosts on the left-hand LAN segment start with 128.32.1, and all of those on the right start 
with 128.32.2. 
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This figure shows a hypothetical site attached to the Internet with one border 
router (i.e., one attachment point to the Internet) and two internal local area net¬ 
works. The value of x could be anything in the range [0,255]. Each of the Ethernet 
networks is an IPv4 subnetwork of the overall network number 128.32, a class B 
address allocation. Eor other sites on the Internet to reach this site, all traffic with 
destination addresses starting with 128.32 is directed by the Internet routing sys¬ 
tem to the border router (specifically, its interface with IPv4 address 137.164.23.30). 
At this point, the border router must distinguish among different subnetworks 
within the 128.32 network. In particular, it must be able to distinguish and sepa¬ 
rate traffic destined for addresses of the form 128.32.1.x from those destined for 
addresses of the form 128.32.2.x. These represent subnetwork numbers 1 and 2, 
respectively, of the 128.32 class B network number. In order to do this, the router 
must be aware of where the subnet ID is to be found within the addresses. This is 
accomplished by a configuration parameter we will discuss next. 

2.3.3 Subnet Masks 

The subnet mask is an assignment of bits used by a host or router to determine how 
the network and subnetwork information is partitioned from the host information 
in a corresponding IP address. Subnet masks for IP are the same length as the cor¬ 
responding IP addresses (32 bits for IPv4 and 128 bits for IPv6). They are typically 
configured into a host or router in the same way as IP addresses—either statically 
(typical for routers) or using some dynamic system such as the Dynamic Host Con¬ 
figuration Protocol (DHCP; see Chapter 6). Eor IPv4, subnet masks may be written 
in the same way an IPv4 address is written (i.e., dotted-decimal). Although not 
originally required to be arranged in this manner, today subnet masks are struc¬ 
tured as some number of 1 bits followed by some number of 0 bits. Because of this 
arrangement, it is possible to use a shorthand format for expressing masks that 
simply gives the number of contiguous 1 bits in the mask (starting from the left). 
This format is now the most common format and is sometimes called the prefix 
length. Table 2-4 presents some examples for IPv4. 


Table 2-4 IPv4 subnet mask examples in various formats 


Dotted-Decimal 

Representation 

Shorthand 
(Prefix Length) 

Binary Representation 

128.0.0.0 

/I 

10000000 00000000 00000000 00000000 

255.0.0.0 

/8 

11111111 00000000 00000000 00000000 

255.192.0.0 

no 

11111111 11000000 00000000 00000000 

255.255.0.0 

716 

11111111 11111111 00000000 00000000 

255.255.254.0 

723 

11111111 11111111 11111110 00000000 

255.255.255.192 

727 

11111111 11111111 11111111 11100000 

255.255.255.255 

732 

11111111 11111111 11111111 11111111 
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Table 2-5 IPv6 subnet mask examples in various formats 


Hex Notation 

Shorthand 
(Prefix Length) 

Binary Representation 


/64 

1111111111111111 1111111111111111 

1111111111111111 1111111111111111 

0000000000000000 0000000000000000 

0000000000000000 0000000000000000 

ffOO:: 

/8 

1111111100000000 0000000000000000 

0000000000000000 0000000000000000 

0000000000000000 0000000000000000 

0000000000000000 0000000000000000 


Table 2-5 presents some examples for IPv6. 

Masks are used by routers and hosts to determine where the network/sub¬ 
network portion of an IP address ends and the host part begins. A bit set to 1 in 
the subnet mask means the corresponding bit position in an IP address should be 
considered part of a combined network/subnetwork portion of an address, which 
is used as the basis for forwarding datagrams (see Chapter 5). Conversely, a bit 
set to 0 in the subnet mask means the corresponding bit position in an IP address 
should be considered part of the host portion. For example, in Figure 2-4 we can 
see how the IPv4 address 128.32.1.14 is treated when a subnet mask of 255.255.255.0 
is applied to it. 


Address 


Mask 

Result 



128 . 32 . 1.14 

255 . 255 . 255.0 (/ 24 ) 

128 . 32 . 1.0 


Figure 2-4 An IP address can be combined with a subnet mask using a bitwise AND operation in 
order to form the network/subnetwork identifier (prefix) of the address used for routing. 
In this example, applying a mask of length 24 to the IPv4 address 128.32.1.14 gives the 
prefix 128.32.1.0/24. 


Here we see how each bit in the address is ANDed with each corresponding 
bit in the subnet mask. Recalling the bitwise AND operation, a result bit is only 
ever a 1 if the corresponding bits in both the mask and the address are 1. In this 
example, we see that the address 128.32.1.14 belongs to the subnet 128.32.1.0/24. 
In Figure 2-3, this is precisely the information required by the border router to 
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determine to which subnetwork a datagram destined for the system with address 
128.32.1.14 should be forwarded. Note again that the rest of the Internet routing 
system does not require knowledge of the subnet mask because routers outside 
the site make routing decisions based only on the network number portion of 
an address and not the combined network/subnetwork or host portions. Conse¬ 
quently, subnet masks are purely a local matter at the site. 

2.3.4 Variable-Length Subnet Masks (VLSM) 

So far we have discussed how a network number allocated to a site can be sub¬ 
divided into ranges assigned to multiple subnetworks, each of the same size and 
therefore able to support the same number of hosts, based on the operational expec¬ 
tations of the network administrator. We now observe that it is possible to use a 
different-length subnet mask applied to the same network number in different por¬ 
tions of the same site. Although doing this complicates address configuration man¬ 
agement, it adds flexibility to the subnet structure because different subnetworks 
may be set up with different numbers of hosts. Variable-length subnet masks (VLSM) 
are now supported by most hosts, routers, and routing protocols. To understand 
how VLSM works, consider the network topology illustrated in Figure 2-5, which 
extends Figure 2-3 with two additional subnetworks using VLSM. 


Internet 



Figure 2-5 VLSM can be used to partition a network number into subnetworks with a differing 
number of hosts on each subnet. Each router and host is configured with a subnet mask 
in addition to its IP address. Most software supports VLSM, except for some older rout¬ 
ing protocols (e.g., RIP version 1). 
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In the more complicated and realistic example shown in Figure 2-5, three dif¬ 
ferent subnet masks are used within the site to subnet the 128.32.0.0/16 network: 
/24, /25, and /26. Doing so provides for a differenf number of hosfs on each sub- 
nef. Recall fhaf fhe number of hosfs is consfrained by fhe number of bifs remain¬ 
ing in fhe IP address fhaf are nof used by fhe nefwork/subnef number. For IPv4 
and a /24 prefix, fhis allows for 32 - 24 = 8 bifs (256 hosfs); for /25, half as many 
(128 hosfs); and for /26, half furfher sfill (64 hosfs). Nofe fhaf each inferface on 
each hosf and roufer depicfed is now given bofh an IP address and a subnef mask, 
buf fhe mask differs across fhe nefwork topology. Wifh an appropriate dynamic 
roufing protocol running among fhe roufers (e.g., OSPF, IS-IS, RIPv2), fraffic is 
able to flow correcfly among hosfs af fhe same sife or fo/from fhe oufside of fhe 
site across fhe Infernef. 

Alfhough if may nof seem obvious, fhere is a common case where a subnef- 
work confains only fwo hosfs. When roufers are connected fogefher by a poinf- 
fo-poinf link requiring an IP address to be assigned af each end, if is common 
pracfice fo use a /31 nefwork prefix wifh IPv4, and if is now also a recommended 
pracfice fo use a /127 prefix for IPv6 [RFC6164]. 

2.3.5 Broadcast Addresses 

In each IPv4 subnefwork, a special address is reserved fo be fhe subnet broadcast 
address. The subnef broadcasf address is formed by seffing fhe nefwork/subnef- 
work porfion of an IPv4 address fo fhe appropriate value and all fhe bifs in fhe Host 
field fo 1. Consider fhe leff-mosf subnef from Figure 2-5. Ifs prefix is 128.32.1.0/24. 
The subnef broadcasf address is consfrucfed by inverfing fhe subnef mask (i.e., 
changing all fhe 0 bifs fo 1 and vice versa) and performing a bifwise OR opera- 
fion wifh fhe address of any of fhe computers on fhe subnef (or, equivalenfly, fhe 
nefwork/subnefwork prefix). Recall fhaf fhe resulf of a bifwise OR operafion is 1 
if eifher inpuf bif is 1. Using fhe IPv4 address 128.32.1.14, fhis compufafion can be 
wriffen as shown in Figure 2-6. 


Address 

Complement 
of Mask 

OR Result 


0 15 16 

31 

10000000 00100000 00000001 

00001110 



00000000 00000000 00000000 

11111111 



10000000 00100000 00000001 

11111111 




128 . 32 . 1.14 

0 . 0 . 0.255 

128 . 32 . 1.255 


Figure 2-6 The subnet broadcast address is formed by ORing the complement of the subnet mask 
with the IPv4 address. In this case of a /24 subnet mask, all of the remaining 32 - 24 
= 8 bits are set to 1, giving a decimal value of 255 and the subnet broadcast address of 
128.32.1.255. 
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As shown in the figure, the subnet broadcast address for the subnet 
128.32.1.0/24 is 128.32.1.255. Historically, a datagram using this type of address as 
its destination has also been known as a directed broadcast. Such a broadcast can, 
at least theoretically, be routed through the Internet as a single datagram until 
reaching the target subnetwork, at which point it becomes a collection of broad¬ 
cast datagrams that are delivered to all hosts on the subnetwork. Generalizing 
this idea further, we could form a datagram with the destination IPv4 address 
128.32.255.255 and launch it into the Internet attached to the network depicted in 
Figure 2-3 or Figure 2-5. This would address all hosts at the target site. 


Note 

Directed broadcasts were found to be such a big problem from a security point of 
view that they are effectively disabled on the Internet today. [RFC0919] describes 
the various types of broadcasts for IPv4, and [RFC1812] suggests that support 
for forwarding directed broadcasts by routers should not only be available but 
enabled by default. This policy was reversed by [RFC2644] so that by default 
routers must now disable the forwarding of directed broadcasts and are even free 
to omit support for the capability altogether. 


In addition to the subnet broadcast address, the special-use address 
255.255.255.255 is reserved as the local net broadcast (also called limited broadcast), 
which is never forwarded by routers. (See Section 2.5 for more detail on special- 
use addresses.) Note that although routers may not forward broadcasts, subnet 
broadcasts and local net broadcasts destined for the same network to which a 
computer is attached should be expected to work unless explicitly disabled by 
end hosts. Such broadcasts do not require action by a router; link-layer broadcast 
mechanisms, if available, are used for supporting them (see Chapter 3). Broadcast 
addresses are typically used with protocols such as UDP/IP (Chapter 10) or ICMP 
(Chapter 8) because these protocols do not involve two-party conversations as in 
TCP/IP. IPv6 lacks any broadcast addresses; for places where broadcast addresses 
might be used in IPv4, IPv6 instead uses exclusively multicast addresses (see 
Chapter 9). 

2.3.6 IPv6 Addresses and Interface Identifiers 

In addition to being longer than IPv4 addresses by a factor of 4, IPv6 addresses 
also have some additional structure. Special prefixes used with IPv6 addresses 
indicate the scope of an address. The scope of an IPv6 address refers to the portion 
of the network where it can be used. Important examples of scopes include node¬ 
local (the address can be used only for communication on the same computer), 
link-local (used only among nodes on the same network link or IPv6 prefix), or 
global (Internet-wide). In IPv6, most nodes have more than one address in use, 
often on the same network interface. Although this is supported in IPv4 as well, it 
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is not nearly as common. The set of addresses required in an IPv6 node, including 
multicast addresses (see Section 2.5.2), is given in [RFC4291]. 


Note 

Another scope level called site-local using prefix fec0::/10 was originally sup¬ 
ported by IPv6 but was deprecated for use with unicast addressing by [RFC3879]. 
The primary problems Include how to handle such addresses given that they may 
be reused In more than one site and a lack of clarity on precisely how to define 
a “site.” 


Link-local IPv6 addresses (and some global IPv6 addresses) use interface iden¬ 
tifiers (IIDs) as a basis for unicasf IPv6 address assignmenf. IIDs are used as fhe 
low-order bifs of an IPv6 address in all cases excepf where fhe address begins wifh 
fhe binary value 000, and as such fhey musf be unique wifhin fhe same nefwork 
prefix. IIDs are ordinarily 64 bifs long and are formed eifher direcfly from fhe 
underlying link-layer MAC address of a nefwork inferface using a modified EUI-64 
format [EUI64], or by anofher process fhaf randomizes fhe value in hopes of pro¬ 
viding some degree of privacy againsf address fracking (see Chapfer 6). 

In IEEE sfandards, EUI sfands for extended unique identifier. EUI-64 idenfifi- 
ers sfarf wifh a 24-bif Organizationally Unique Identifier (OUI) followed by a 40-bif 
extension identifier assigned by fhe organizafion, which is idenfified by fhe firsf 24 
bifs. The OUIs are mainfained and allocafed by fhe IEEE regisfrafion aufhorify 
[lEEERA]. EUIs maybe "universally adminisfered" or "locally adminisfered." In 
fhe Infernef confexf, such addresses are fypically of fhe universally adminisfered 
variefy. 

Many IEEE sfandards-complianf nefwork inferfaces (e.g., Efhernef) have used 
shorfer-formaf addresses (48-bif EUIs) for years. The only significanf difference 
befween fhe EUI-48 and EUI-64 formafs is fheir lengfh (see Eigure 2-7). 


0 5 



OUI 

(24 bits) 


Assigned by Organization 


Figure 2-7 The EUI-48 and EUI-64 formats defined by the IEEE. These are used within IPv6 to form 
interface identifiers by inverting the u bit. 
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The OUI is 24 bits long and occupies the first 3 bytes of both EUI-48 and EUI- 
64 addresses. The low-order 2 bits of the first bytes of these addresses are desig¬ 
nated the u and g bits, respectively. The u bit, when set, indicates that the address 
is locally administered. The g bit, when set, indicates that the address is a group or 
multicast-type address. Eor the moment, we are concerned only with cases where 
the g bit is not set. 

An EUI-64 can be formed from an EUI-48 by copying the 24-bit OUI value from 
the EUT48 address to the EUT64 address, placing the 16-bit value 1111111111111110 
(hex EEEE) in the fourth and fifth bytes of the EUI-64 address, and then copying 
the remaining organization-assigned bits. Eor example, the EUT48 address 00-11- 
22-33-44-55 would become 00-11-22-EE-EE-33-44-55 in EUT64. This mapping is the 
first step used by IPv6 in constructing its interface identifiers when such under¬ 
lying EUI-48 addresses are available. The modified EUI-64 used to form IIDs for 
IPv6 addresses simply inverts the u bit. 

When an IPv6 interface identifier is needed for a type of interface that does not 
have an EUI-48-bit address provided by its manufacturer, but has some other type 
of underlying address (e.g., AppleTalk), the underlying address is left-padded with 
zeros to form the interface identifier. Interface identifiers created for interfaces 
that lack any form of other identifier (e.g., tunnels, serial links) may be derived 
from some other interface on the same node (that is not on the same subnet) or 
from some identifier associated with the node. Lacking any other options, manual 
assignment is a last resort. 

2.3.6.1 Examples 

Using the Linux if conf ig command, we can investigate the way a link-local IPv6 
address is formed: 


Linux% ifconfig ethl 

ethl Link encap:Ethernet HWaddr 00:30:48:2A:19:89 

inet addr:12.46.129.28 Beast:12.46.129.127 
Mask:255.255.255.128 

inet6 addr: fe80::230:48ff:fe2a:1989/64 ScopeiLink 
UP BROADCAST RUNNING MULTICAST MTU: 1500 Metric:! 

RX packets:1359970341 errors:0 dropped:0 overruns:0 frame:0 
TX packets:1472870787 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:1000 

RX bytes:4021555658 (3.7 GiB) TX bytes:3258456176 (3.0 GiB) 

Base address:0x3040 Memory:f8220000-f8240000 

Here we can see how the Ethernet's hardware address 00:30:48:2A:19:89 is 
mapped to an IPv6 address. Pirst, it is converted to EUI-64, forming the address 
00:30:48:ff:fe:2a:19:89. Next, the u bit is inverted, forming the IID value 
02:30:48:ff:fe:2a:19:89. To complete the link-local IPv6 address, we use 
the reserved link-local prefix fe80: :/10 (see Section 2.5). Together, these form 
the complete address, fe80: :230:48ff :fe2a:1989. The presence of /64 is the 
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standard length used for identifying fhe subnefwork/hosf porfion of an IPv6 
address derived from an IID as required by [RFC4291]. 

Anofher inferesfing example is from a Windows sysfem wifh IPv6. In fhis 
case, we see a special tunnel endpoint, which is used fo carry IPv6 fraffic fhrough 
nefworks fhaf ofherwise supporf only IPv4: 

c:\> ipconfig /all 

Tunnel adapter Automatic Tunneling Pseudo-Interface: 

Connection-specific DNS Suffix . : foo 

Description . : Automatic Tunneling 

Pseudo-Interface 


Physical Address.: 0A-99-8D-87 

Dhcp Enabled.: No 

IP Address.: fe80 : : 5efe: 10.153.141.135%2 

Default Gateway . : 

DNS Servers . : fecO:0:0:ffff::1%2 

fec0:0:0:ffff::2%2 
fec0:0:0:ffff::3%2 

NetBIOS over Tcpip.: Disabled 


In fhis case, we can see a special funneling inferface called ISATAP [RFC5214]. 
The so-called physical address is really fhe hexadecimal encoding of an IPv4 
address: 0A-99-8D-87 is fhe same as 10.153.141.135. Here, fhe OUI used (00- 
00-5E) is fhe one assigned fo fhe lANA [lANA]. If is used in combinafion wifh 
fhe hex value fe, indicafing an embedded IPv4 address. This combinafion is 
fhen combined wifh fhe sfandard link-local prefix fe80::/10 fo give fhe address 
fe80:: 5efe :10.153.141.135. The %2 appended fo fhe end of fhe address is called 
a zone ID in Windows and indicafes fhe inferface index number on fhe compufer 
corresponding fo fhe IPv6 address. IPv6 addresses are offen creafed by a process 
of aufomafic configurafion, a process we discuss in more defail in Chapfer 6. 


2.4 CIDR and Aggregation 

In fhe early 1990s, affer fhe adopfion of subnef addressing fo ease one form of 
growing pains, fhe Infernef sfarfed facing a serious sef of scaling problems. Three 
parficular issues were considered so imporfanf as fo require immediafe affenfion: 

1. By 1994, over half of all class B addresses had already been allocafed. If was 
expecfed fhaf fhe class B address space would be exhausfed by abouf 1995. 

2. The 32-bif IPv4 address was fhoughf fo be inadequafe fo handle fhe size of 
fhe Infernef anficipafed by fhe early 2000s. 
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3. The number of entries in the global routing table (one per network num¬ 
ber), about 65,000 in 1995, was growing. As more and more class A, B, and 
C routing entries appeared, routing performance would suffer. 

These three issues were attacked by a group in the IETF called ROAD (for 
ROuting and ADdressing), starting in 1992. They considered problems 1 and 3 to 
be of immediate concern, and problem 2 as requiring a long-term solution. The 
short-term solution they proposed was to effectively remove the class breakdown 
of IP addresses and also promote the ability to aggregate hierarchically assigned 
IP addresses. These measures would help problems 1 and 3. IPv6 was envisioned 
to deal with problem 2. 

2.4.1 Prefixes 

In order to help relieve the pressure on the availability of IPv4 addresses (espe¬ 
cially class B addresses), the classful addressing scheme was generalized using a 
scheme similar to VLSM, and the Internet routing system was extended to support 
Classless Inter-Domain Routing (CIDR) [RFC4632]. This provided a way to conve¬ 
niently allocate contiguous address ranges that contained more than 255 hosts but 
fewer than 65,536. That is, something other than single class B or multiple class 
C network numbers could be allocated to sites. Using CIDR, any address range 
is not predefined as being part of a class but instead requires a mask similar to a 
subnet mask, sometimes called a CIDR mask. CIDR masks are not limited to a site 
but are instead visible to the global routing system. Thus, the core Internet routers 
must be able to interpret and process masks in addition to network numbers. This 
combination of numbers, called a network prefix, is used for both IPv4 and IPv6 
address management. 

Eliminating the predefined separation of network and host number within an 
IP address makes f iner-grain allocation of IP address ranges possible. As with class¬ 
ful addressing, dividing the address spaces into chunks is most easily achieved by 
grouping numerically contiguous addresses for use as a type or for some particu¬ 
lar special purpose. Such groupings are now commonly expressed using a prefix 
of the address space. An n-bit prefix is a predefined value for the first n bits of an 
address. The value of n (the length of the prefix) is typically expressed as an inte¬ 
ger in the range 0-32 for IPv4 and 0-128 for IPv6. It is generally appended to the 
base IP address following a / character. Table 2-6 gives some examples of prefixes 
and their corresponding IPv4 or IPv6 address ranges. 

In the table, the bits defined and fixed by the prefix are enclosed in a box. 
The remaining bits may be set to any combination of Os and Is, thereby cover¬ 
ing the possible address range. Clearly, a smaller prefix length corresponds to a 
larger number of possible addresses. In addition, the earlier classful addressing 
approach is easily generalized by this scheme. For example, the class C network 
number 192.125.3.0 can be written as the prefix 192.125.3.0/24 or 192.125.3/24. 
Classful A and B network numbers can be expressed using /8 and /16 prefix 
lengths, respectively. 
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Table 2-6 Examples of prefixes and their corresponding IPv4 or IPv6 address range 


Prefix 

Prefix (Binary) 

Address Range 

o.o.o.o/o 

00000000 00000000 00000000 00000000 

0.0.0.0-255.255.255.255 

128.0.0.0/1 

@0000000 00000000 00000000 00000000 

128.0.0.0-255.255.255.255 

128.0.0.0/24 

llOOOOOOO 00000000 ooooooooloooooooo 

128.0.0.0-128.0.0.255 

198.128.128.192/27 

IllOOOllO 10000000 10000000 iiooloooo 

198.128.128.192-198.128.128.223 

165.195.130.107/32 

lioiooioi 11000011 10000010 OllOlOllI 

165.195.130.107 

2001:db8::/32 

bOlOOOOOOOOOOOOl 00001101101110001 

0000000000000000 0000000000000000 

0000000000000000 0000000000000000 

0000000000000000 0000000000000000 

2001:db8::-2001:db8:ffff:ffff 


2.4.2 Aggregation 

Removing the classful structure of IP addresses made it possible to allocate IP 
address blocks in a wider variety of sizes. Doing so, however, did not address 
the third concern from the list of problems; it did not help to reduce the number 
of routing table entries. A routing table entry tells a router where to send traffic. 
Essentially, the router inspects the destination IP address in an arriving datagram, 
finds a matching routing table entry, and from the entry extracts the "next hop" 
for the datagram. This is somewhat like driving to a particular address in a car 
and in every intersection along the way finding a sign indicating what direction 
to take to get to the next intersection on the way to the destination. If you consider 
the number of signs that would have to be present at every intersection for every 
possible destination neighborhood, you get some sense of the problem facing the 
Internet in the early 1990s. 

At the time, few techniques were known to dramatically reduce the number 
of routing table entries while maintaining shortest-path routes to all destinations 
in the Internet. The best-known approach was published in a study of hierarchical 
routing [KK77] in the late 1970s by Kleinrock and Kamoun. They observed that if 
the network topology were arranged as a tree' and addresses were assigned in a 
way that was "sensitive" to this topology, very small routing tables could be used 
while still maintaining shortest-path routes to all destinations. Consider Figure 2-8. 

In this figure, circles represent routers and lines represent network links 
between them. The left-hand and right-hand sides of the diagram show tree¬ 
shaped networks. The difference between them is the way addresses have been 
assigned to the routers. In the left-hand (a) side, addresses are essentially ran¬ 
dom—there is no direct relationship between the addresses and the location of 


1. In graph theory, a tree is a connected graph with no cycles. For a network of routers and links, this 
means that there is only one simple (nonduplicative) path between any two routers. 
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3 Entries 



(a) Random (Location Independent) (b) Topology Sensitive (Location Dependent) 

Addressing Addressing 

Figure 2-8 In a network with a tree topology, network addresses can be assigned in a special way so as to limit 
the amount of routing information ("state") that needs to be stored in a router. If addresses are 
not assigned in this way (left side), shortest-path routes cannot be guaranteed without storing an 
amount of state proportional to the number of nodes to be reached. While assigning addresses in 
a way that is sensitive to the tree topology saves state, if the network topology changes, a reassign¬ 
ment of addresses is generally required. 


the routers in the tree. On the right-hand (b) side of the diagram, the addresses 
are assigned based upon where the router is located in the tree. If we consider 
fhe number of enfries each fop roufer requires, we see fhaf fhere is a significanf 
difference. 

The roof (fop) of fhe free on fhe leff is fhe roufer labeled 19.12.4.8. In order fo 
know a nexf hop for every possible desfinafion, if needs an enfry for all fhe roufers 
"below" if in fhe free: 190.16.11.2, 86.12.0.112, 159.66.2.231, 133.17.97.12, 66.103.2.19, 
18.1.1.1,19.12.4.9, and 203.44.23.198. For any ofher desfinafion, if simply roufes fo fhe 
cloud labeled "Ofher Parfs of fhe Nefwork." This resulfs in a fofal of nine enfries. 
In confrasf, fhe roof of fhe righf-hand free is labeled 19.0.0.1 and requires only fhree 
enfries in ifs roufing fable. Nofe fhaf all of fhe roufers on fhe leff side of fhe righf 
free begin wifh fhe prefix 19.1 and all fo fhe righf begin wifh 19.2. Thus, fhe fable 
in roufer 19.0.0.1 need only show 19.1.0.1 as fhe nexf hop for any desfinafion sfarf- 
ing wifh 19.1, whereas 19.2.0.1 is fhe nexf hop for any desfinafion sfarfing wifh 19.2. 
Any ofher desfinafion goes fo fhe cloud labeled "Ofher Parfs of fhe Nefwork." This 
resulfs in a fofal of fhree enfries. Nofe fhaf fhis behavior is recursive—any roufer 
in fhe (b) side of fhe free never requires more enfries fhan fhe number of links if 
has. This is a direcf resulf of fhe special mefhod used fo assign fhe addresses. Even 
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if more routers are added to the (b)-side tree, this nice property is maintained. 
This is the essence of the hierarchical routing idea from [KK77]. 

In the Internet context, the hierarchical routing idea can be used in a specific 
way to reduce the number of Internet routing entries that would be required other¬ 
wise. This is accomplished by a procedure known as route aggregation. It works by 
joining multiple numerically adjacent IP prefixes into a single shorter prefix (called 
an aggregate or summary) that covers more address space. Consider Figure 2-9. 


190.154.27.0/26 

190.154.27.64/26 

190.154.27.192/26 


^190.154.27.0/25 

190.154.27.192/26^ 

190.154.27.128/26^ 


190-154.27.0/25 ,^ 190 . 154 . 27 . 0/24 

190.154.27.128/25 190.154.26.0/24 

190 . 154 . 26 . 0/23 



Figure 2-9 In this example, the arrows indicate aggregation of two address prefixes to form one; 

the underlined prefixes are additions in each step. In the first step, 190.154.27.0/26 
and 190.154.27.64.0/26 can be aggregated because they are numerically adjacent, but 
190.154.27.192/26 cannot. With the addition of 190.154.27.128/26, they can all be aggre¬ 
gated together in two steps to form 190.154.27.0/24. With the final addition of the adjacent 
190.154.26.0/24, the aggregate 190.154.26.0/23 is produced. 


We start with three address prefixes on the left in Figure 2-9. The first two, 
190.154.27.0/26 and 190.154.27.64/26, are numerically adjacent and can therefore 
be combined (aggregated). The arrows indicate where aggregation takes place. 
The prefix 190.154.27.192/26 cannot be aggregated in the first step because it is not 
numerically adjacent. When a new prefix, 190.154.27.128/26, is added (underlined), 
the 190.154.27.192/26 and 190.154.27.128/26 prefixes may be aggregated, forming 
the 190.154.27.128/25 prefix. This aggregate is now adjacent to the 190.154.27.0/25 
aggregate, so they can be aggregated further to form 190.154.27.0/24. When the 
prefix 190.154.26.0/24 (underlined) is added, the two class C prefixes can be aggre¬ 
gated to form 190.154.26.0/23. In this way, the original three prefixes and the two 
that were added can be aggregated into a single prefix. 


2.5 Special-Use Addresses 

Both the IPv4 and IPv6 address spaces include a few address ranges that are used 
for special purposes (and are therefore not used in assigning unicast addresses). 
For IPv4, these addresses are given in Table 2-7 [RFC5735]. 
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Table 2-7 IPv4 special-use addresses (defined January 2010) 


Prefix 

Special Use 

Reference 

0.0.0.0/8 

Hosts on the local network. May be used only as a source IP 
address. 

[RFC1122] 

10.0.0.0/8 

Address for private networks (intranets). Such addresses 
never appear on the public Internet. 

[RFC1918] 

127.0.0.0/8 

Internet host loopback addresses (same computer). Typically 
only 127.0.0.1 is used. 

[RFC1122] 

169.254.0.0/16 

"Link-local" addresses—used only on a single link and 
generally assigned automatically. See Chapter 6. 

[RFC3927] 

172.16.0.0/12 

Address for private networks (intranets). Such addresses 
never appear on the public Internet. 

[RFC1918] 

192.0.0.0/24 

IETF protocol assignments (IANA reserved). 

[RFC5736] 

192.0.2.0/24 

TEST-NET-1 addresses approved for use in documentation. 
Such addresses never appear on the public Internet. 

[RFC5737] 

192.88.99.0/24 

Used for 6to4 relays (anycast addresses). 

[RFC3068] 

192.168.0.0/16 

Address for private networks (intranets). Such addresses 
never appear on the public Internet. 

[RFC1918] 

198.18.0.0/15 

Used for benchmarks and performance testing. 

[RFC2544] 

198.51.100.0/24 

TEST-NET-2. Approved for use in documentation. 

[RFC5737] 

203.0.113.0/24 

TEST-NET-3. Approved for use in documentation. 

[RFC5737] 

224.0.0.0/4 

IPv4 multicast addresses (formerly class D); used only as 
destination addresses. 

[RFC5771] 

240.0.0.0/4 

Reserved space (formerly class E), except 255.255.255.255. 

[RFC1112] 

255.255.255.255/32 

Local network (limited) broadcast address. 

[RFC0919] 

[RFC0922] 


In IPv6, a number of address ranges and individual addresses are used for 
specific purposes. They are lisfed in Table 2-8 [RFC5156]. 

For bofh IPv4 and IPv6, address ranges nof designafed as special, mulficasf, or 
reserved are available fo be assigned for unicasf use. Some unicasf address space 
(prefixes 10/8,172.16/12, and 192.168/16 for IPv4 and fc00::/7 for IPv6) is reserved 
for building privafe nefworks. Addresses from fhese ranges can be used by coop- 
erafing hosfs and roufers wifhin a sife or organizafion, buf nof across fhe global 
Infernef. Thus, fhese addresses are somefimes called nonroutable addresses. Thai 
is, fhey will nof be roufed by fhe public Infernef. 

The managemenf of privafe, nonroufable address space is enfirely a local deci¬ 
sion. The IPv4 privafe addresses are very common in home nefworks and for fhe 
infernal nefworks of moderafely sized and large enferprises. They are frequenfly 
used in combinafion wifh network address translation (NAT), which rewrifes IP 
addresses inside IP dafagrams as fhey enfer fhe Infernef. We discuss NAT in defail 
in Chapfer 7. 
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Table 2-8 IPv6 special-use addresses (defined April 2008) 


Prefix 

Special Use 

Reference 

::/0 

Default route entry. Not used for addressing. 

[RFC5156] 

::/128 

The unspecified address; may be used as a source IP address. 

[RFC4291] 

::1/128 

The IPv6 host loopback address; not used in datagrams sent 
outside the local host. 

[RFC4291] 

::ffff:0:0/96 

IPv4-mapped addresses. Such addresses never appear in 
packet headers. For internal host use only. 

[RFC4291] 

::|ipv4-address}/96 

IPv4-compatible addresses. Deprecated; not to be used. 

[RFC4291] 

2001::/32 

Teredo addresses. 

[RFC4380] 

2001:10::/28 

Overlay Routable Cryptographic Hash Identifiers. Such 
addresses never appear on the public Internet. 

[RFC4843] 

2001:db8::/32 

Address range used for documentation and for examples. 

Such addresses never appear on the public Internet. 

[RFC3849] 

2002::/16 

6to4 addresses of 6to4 tunnel relays. 

[RFC3056] 

3ffe::/16 

Used by 6bone experiments. Deprecated; not to be used. 

[RFC3701] 

5f00::/16 

Used by 6bone experiments. Deprecated; not to be used. 

[RFC3701] 

fc00::/7 

Unique, local unicast addresses; not used on the global 
Internet. 

[RFC4193] 

fe80::/10 

Link-local unicast addresses. 

[RFC4291] 

ff00::/8 

IPv6 multicast addresses; used only as destination addresses. 

[RFC4291] 


2.5.1 Addressing IPv4/IPv6 Translators 

In some networks, it may be attractive to perform translation between IPv4 and 
IPv6 [RFC6127]. A framework for fhis has been developed for unicasf franslafions 
[RFC6144], and one is currenfly under developmenf for mulficasf franslafions [IDv- 
4v6mc]. One of fhe basic funcfions is fo provide aufomafed, algorifhmic franslafion 
of addresses. Using fhe "well-known" IPv6 prefix 64:ff9b::/96 or anofher assigned 
prefix, [RFC6052] specifies how fhis is accomplished for unicasf addresses. 

The scheme makes use of a specialized address formal called an IPv4-embed- 
ded IPv6 address. This type of address confains an IPv4 address inside an IPv6 
address. If can be encoded using one of six formafs, based on fhe lengfh of fhe IPv6 
prefix, which is required fo be one of fhe following: 32, 40, 48, 56, 64, or 96. The 
formafs available are shown in Figure 2-10. 

In fhe figure, fhe prefix is eifher fhe well-known prefix or a prefix unique fo 
fhe organizafion deploying franslafors. Bifs 64-71 musf be sef fo 0 fo mainfain 
compafibilify wifh idenfifiers specified in [RFC4291]. The suffix bifs are reserved 
and should be sef fo 0. The mefhod fo produce an IPv4-embedded IPv6 address 
is fhen simple: concafenafe fhe IPv6 prefix wifh fhe 32-bif IPv4 address, ensur¬ 
ing fhaf fhe bifs 63-71 are sef fo 0 (inserfing if necessary). Append fhe suffix as 
0 bifs unfil a 128-bif address is produced. IPv4-embedded IPv6 addresses using 
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Prefix 

Length ^ jg gg .|gg .,27 

32 
40 
48 
56 
64 
96 

Figure 2-10 IPv4 addresses can be embedded within IPv6 addresses, forming an IPv4-embedded 
IPv6 address. Six different formats are available, depending on the IPv6 prefix length in 
use. The well-known prefix 64:ff9b::/96 can be used for automatic translation between 
IPv4 and IPv6 unicast addresses. 


IPv6 Prefix IPv4 Address 

{32 bits) (32 bits) 

u 

Suffix (56 bits) 


IPv6 Prefix IPv4 Address 

(40 bits) (first 24 bits) 

u 

Adless Suffix (48 bits) 


IPv6 Prefix IPv4 Address 

(48 bits) (first 16 bits) 

u 

IPv4 Address r/tn ku.^\ 

(last 16 bits) Suffix (40 bits) 


IPv6 Prefix ipv4 

u \ Address 

(56 bits) (nrstSb^) 

u 

IPv4 Address ^ u-* s 

(last 24 bits) Suffix (32 bits) 


IPv6 Prefix 
(64 bits) 

u 

IPv4 Address o «• v 

(32 bits) Suffix (24 bits) 


IPv6 Prefix 

IPv4 Address 

(96 bits) 

{32 bits) 


the 96-bit prefix option may be expressed using the convention for IPv6-mapped 
addresses mentioned previously (Section 2.2(3) of [RFC4291]). For example, 
embedding the IPv4 address 198.51.100.16 with the well-known prefix produces 
the address 64:ff9b::198.51.100.16. 

2.5.2 Multicast Addresses 

Multicast addressing is supported by IPv4 and IPv6. An IP multicast address (also 
called group or group address) identifies a group of host interfaces, rather than a 
single one. Generally speaking, the group could span the entire Internet. The 
portion of the network that a single group covers is known as the group's scope 
[RFC2365]. Common scopes include node-local (same computer), link-local (same 
subnet), site-local (applicable to some site), global (entire Internet), and administra¬ 
tive. Administrative scoped addresses may be used in an area of the network that 
has been manually configured into routers. A site administrator may configure 
routers as admin-scope boundaries, meaning that multicast traffic of the associated 
group is not forwarded past the router. Note that the site-local and administrative 
scopes are available for use only with multicast addressing. 

Under software control, the protocol stack in each Internet host is able to join 
or leave a multicast group. When a host sends something to a group, it creates a 
datagram using one of its own (unicast) IP addresses as the source address and 
a multicast IP address as the destination. All hosts in scope that have joined the 
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group should receive any datagrams sent to the group. The sender is not generally 
aware of the hosts receiving the datagram unless they explicitly reply. Indeed, the 
sender does not even know in general how many hosts are receiving its datagrams. 

The original multicast service model, described so far, has become known as 
any-source multicast (ASM). In fhis model, any sender may send fo any group; a 
receiver joins fhe group by specifying only fhe group address. A newer approach, 
called source-specific multicast (SSM) [RFC3569][RFC4607], uses only a single sender 
per group (also see fhe errafa fo [RFC4607]). In fhis case, when joining a group, 
a hosf specifies fhe address of a channel, which comprises bofh a group address 
and a source IP address. SSM was developed fo avoid some of fhe complexifies in 
deploying fhe ASM model. Alfhough neifher form of mulficasf is widely available 
fhroughouf fhe Infernef, if seems fhaf SSM is now fhe more likely candidafe for 
adopfion. 

Undersfanding and implemenfing wide area mulficasfing has been an ongo¬ 
ing efforf wifhin fhe Infernef communify for more fhan a decade, and a large 
number of protocols have been developed fo supporf if. Full defails of how global 
Infernef mulficasfing works are fherefore beyond fhe scope of fhis fexf, buf fhe 
interested reader is directed fo [IMR02]. Defails of how local IP mulficasf operates 
are given in Chapter 9. For now, we shall discuss fhe formaf and meaning of IPv4 
and IPv6 mulficasf addresses. 

2.5.3 IPv4 Multicast Addresses 

For IPv4, the class D space (224.0.0.0-239.255.255.255) has been reserved for 
supporting multicast. With 28 bits free, this provides for the possibility of 2^* = 
268,435,456 host groups (each host group is an IP address). This address space is 
divided into major sections based on the way they are allocated and handled with 
respect to routing [IP4MA]. Those major sections are presented in Table 2-9. 

The blocks of addresses up to 224.255.255.255 are allocated for the exclusive 
use of certain application protocols or organizations. These are allocated as the 
result of action by the lANA or by the IETF. The local network control block is 
limited to the local network of the sender; datagrams sent to those addresses are 
never forwarded by multicast routers. The All Hosts group (224.0.0.1) is one group 
in this block. The internetwork control block is similar to the local network control 
range but is intended for control traffic that needs to be routed off the local link. 
An example from this block is the Network Time Protocol (NTP) multicast group 
(224.0.1.1) [RFC5905]. 

The first ad hoc block was constructed to hold addresses that did not fall into 
either the local or internetwork control blocks. Most of the allocations in this range 
are for commercial services, some of which do not (or never will) require global 
address allocations; they may eventually be returned in favor of GLOP^ address¬ 
ing (see the next paragraphs). The SDP/SAP block contains addresses used by 


2. GLOP is not an acronym but instead simply a name for a portion of address space. 
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Table 2-9 Major sections of IPv4 class D address space used for supporting multicast 


Range (Inclusive) 

Special Use 

Reference 

224.0.0.0-224.0.0.255 

Local network control; not forwarded 

[RFC5771] 

224.0.1.0-224.0.1.255 

Internetwork control; forwarded normally 

[RFC5771] 

224.0.2.0-224.0.255.255 

Ad hoc block I 

[RFC5771] 

224.1.0.0-224.1.255.255 

Reserved 

[RFC5771] 

224.2.0.0-224.2.255.255 

SDP/SAP 

[RFC4566] 

224.3.0.0-224.4.255.255 

Ad hoc block II 

[RFC5771] 

224.5.0.0-224.255.255.255 

Reserved 

[IP4MA] 

225.0.0.0-231.255.255.255 

Reserved 

[IP4MA] 

232.0.0.0-232.255.255.255 

Source-specific multicast (SSM) 

[RFC4607] 

[RFC4608] 

233.0.0.0-233.251.255.255 

GLOP 

[RFC3180] 

233.252.0.0-233.255.255.255 

Ad hoc block III 

(233.252.0.0/24 is reserved for documentation) 

[RFC5771] 

234.0.0.0-234.255.255.255 

Unicast-prefix-based IPv4 multicast addresses 

[RFC6034] 

235.0.0.0-238.255.255.255 

Reserved 

[IP4MA] 

239.0.0.0-239.255.255.255 

Administrative scope 

[RFC2365] 


applications such as the session directory tool (SDR) [H96] that send multicast 
session announcements using the Session Announcement Protocol (SAP) [RFC2974]. 
Originally a component of SAP, the newer Session Description Protocol (SDP) 
[RFC4566] is now used not only with IP multicast but also with other mechanisms 
to describe multimedia sessions. 

The other major address blocks were created somewhat later in the evolution of 
IP multicast. The SSM block is used by applications employing SSM in combination 
with their own unicast source IP address in forming SSM channels, as described 
previously. In the GLOP block, multicast addresses are based on the autonomous 
system (AS) number of the host on which the application allocating the address 
resides. AS numbers are used by Internet-wide routing protocols among ISPs in 
order to aggregate routes and apply routing policies. Each such AS has a unique 
AS number. Originally, AS numbers were 16 bits but have now been extended to 
32 bits [RFC4893]. GLOP addresses are generated by placing a 16-bit AS number in 
the second and third bytes of the IPv4 multicast address, leaving room for 1 byte to 
represent the possible multicast addresses (i.e., up to 256 addresses). Thus, it is pos¬ 
sible to map back and forth between a 16-bit AS number and the GLOP multicast 
address range associated with an AS number. Although this computation is simple, 
several online calculators have been developed to do it, too.® 


3. For example, http://gigapop.uoregon.edu/glop/. 
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The most recent of the IPv4 multicast address allocation mechanisms associates 
a number of mulficasf addresses wifh an IPv4 unicasf address prefix. This is called 
unicast-prefix-based mulficasf addressing (UBM) and is described in [RFC6034]. If is 
based on a similar sfrucfure developed earlier for IPv6 fhaf we discuss in Secfion 
2.5.4. The UBM IPv4 address range is 234.0.0.0 fhrough 234.255.255.255. A unicasf 
address allocafion wifh a /24 or shorfer prefix may make use of UBM addresses. 
Allocafions wifh fewer addresses (i.e., a /25 or longer prefix) musf use some ofher 
mechanism. UBM addresses are consfrucfed as a concafenafion of fhe 234/8 pre¬ 
fix, fhe allocafed unicasf prefix, and fhe mulficasf group ID. Figure 2-11 shows fhe 
formaf. 


0 78 N 31 


234 

Unicast Prefix 

Group ID 

(8 bits) 

(up to 24 bits) 

(up to 16 bits) 


Figure 2-11 The IPv4 UBM address format. For unicast address allocations of /24 or shorter, associ¬ 
ated multicast addresses are allocated based on a concatenation of the prefix 234/8, the 
assigned unicast prefix, and the multicast group ID. Allocations with shorter unicast 
prefixes therefore contain more unicast and multicast addresses. 


To determine the set of UBM addresses associated with a unicast allocation, 
the allocated prefix is simply prepended with the 234/8 prefix. For example, the 
unicast IPv4 address prefix 192.0.2.0/24 has a single associated UBM address, 
234.192.0.2. It is also possible to determine the owner of a multicast address by 
simply "left-shifting" the multicast address by 8 bit positions. We know that the 
multicast address range 234.128.32.0/24 is allocated to UC Berkeley, for example, 
because the corresponding unicast IPv4 address space 128.32.0.0/16 (the "left- 
shifted" version of 234.128.32.0) is owned by UC Berkeley (as can be determined 
using a WHOIS query; see Section 2.6.1.1). 

UBM addresses may offer advantages over the other types of multicast 
address allocations. For example, they do not carry the 16-bit restriction for AS 
numbers used by GLOP addressing. In addition, they are allocated as a conse¬ 
quence of already-existing unicast address space allocations. Thus, sites wishing 
to use multicast addresses already know which addresses they can use without 
further coordination. Finally, UBM addresses are allocated at a finer granular¬ 
ity than GLOP addresses, which correspond to AS number allocations. In today's 
Internet, a single AS number may be associated with multiple sites, frustrating the 
simple mapping between address and owner supported by UBM. 

The administratively scoped address block can be used to limit the distribu¬ 
tion of multicast traffic to a particular collection of routers and hosts. These are 
the multicast analogs of private unicast IP addresses. Such addresses should not 
be used for distributing multicast into the Internet, as most of them are blocked at 
enterprise boundaries. Large sites sometimes subdivide administratively scoped 
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multicast addresses to cover specific useful scopes (e.g., work group, division, and 
geographical area). 


2.5.4 IPv6 Multicast Addresses 

For IPv6, which is considerably more aggressive in its use of multicast, the prefix 
ff00::/8 has been reserved for multicast addresses, and 112 bits are available for 
holding the group number, providing for the possibility of 

2'“ = 5,192,296,858,534,827,628,530,496,329,220,096 

groups. Its general format is as shown in Figure 2-12. 


16 


11111111 


Group ID (112 bits) 


Flags Scope 
(4 bits) (4 bits) 

Figure 2-12 The base IPv6 multicast address format includes 4: flag bits (0, reserved; R, contains ren¬ 
dezvous point; P, uses unicast prefix; T, is transient). The 4-bit Scope value indicates the 
scope of the multicast (global, local, etc.). The Group ID is encoded in the low-order 112 
bits. If the P or R bit is set, an alternative format is used. 


The second byte of the IPv6 multicast address includes a 4-bit Flags field and a 
4-bit Scope ID field in the second nibble. The Scope field is used to indicate a limit 
on the distribution of datagrams addressed to certain multicast addresses. The 
hexadecimal values 0, 3, and f are reserved. The hex values 6, 7, and 9 through d 
are unassigned. The values are given in Table 2-10, which is based on Section 2.7 
of [RFC4291]. 


Table 2-10 Values of the IPv6 Scope field 


Value 

Scope 

0 

Reserved 

1 

Interface-/machine-local 

2 

Link-/subnet-local 

3 

Reserved 

4 

Admin 

5 

Site-local 

6-7 

Unassigned 

8 

Organizational-local 

9-d 

Unassigned 

e 

Global 

f 

Reserved 
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Many IPv6 multicast addresses allocated by the lANA for permanent use 
intentionally span multiple scopes. Each of fhese is defined wifh a cerfain offsef 
relafive fo every scope (such addresses are called scope-relative or variable-scope 
for fhis reason). For example, fhe variable-scope mulficasf address ff0x::101 is 
reserved for NTP servers by [IP6MA]. The x indicafes variable scope; Table 2-11 
shows some of fhe addresses defined by fhis reservafion. 


Table 2-11 Example permanent variable-scope IPv6 multicast address reservations for NTP (101) 


Address 

Meaning 

ff01::101 

All NTP servers on the same machine 

ff02::101 

All NTP servers on the same link/subnet 

ff04::101 

All NTP servers within some administratively defined scope 

ff05::101 

All NTP servers at the same site 

ff08::101 

All NTP servers at the same organization 

ff0e::101 

All NTP servers in the Internet 


In IPv6, the multicast address format given in Figure 2-12 is used when the 
P and R bit fields are set to 0. When P is set to 1, two alternative methods exist 
for multicast addresses that do not require global agreement on a per-group basis. 
These are described in [RFC3306] and [RFC4489]. In the first, called unicast-prefix- 
based IPv6 multicast address assignment, a unicast prefix allocation provided by an 
ISP or address allocation authority also effectively allocates a collection of multicast 
addresses, thereby limiting the amount of global coordination required for avoid¬ 
ing duplicates. With the second method, link-scoped IPv6 multicast, interface identi¬ 
fiers are used, and multicast addresses are based on a host's IID. To understand 
how these various formats work, we need to first understand the use of the bit 
fields in the IPv6 multicast address in more detail. They are defined in Table 2-12. 


Table 2-12 IPv6 multicast address flags 


Bit Field 
(Flag) 

Meaning 

Reference 

R 

Rendezvous point flag (0, regular; 1, RP address included) 

[RFC3956] 

P 

Prefix flag (0, regular; 1, address based on unicast prefix) 

[RFC3306] 

T 

Transient flag (0, permanently assigned; 1, transient) 

[RFC4291] 


The T bit field, when set, indicates that the included group address is tempo¬ 
rary or dynamically allocated; it is not one of the standard addresses defined in 
[IP6MA]. When the P bit field is set to 1, the T bit must also be set to 1. When this 
happens, a special format of IPv6 multicast addresses based on unicast address 
prefixes is enabled, as shown in Figure 2-13. 
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0 


16 


11111111 oioiih 


00000000 


0 for SSM 


(0 


Flags Scope Reserved Prefix Length Prefix Group ID 

(4 bits) (4 bits) (8 bits) (8 bits) (64 bits) (32 bits) 

Figure 2-13 IPv6 multicast addresses can be created based upon unicast IPv6 address assignments 
[RFC3306]. When this is done, the P bit field is set to 1, and the unicast prefix is carried 
in the address, along with a 32-bit group ID. This form of multicast address allocation 
eases the need for global address allocation agreements. 


We can see here how using unicast-prefix-based addressing changes the for¬ 
mat of the multicast address to include space for a unicast prefix and its length, 
plus a smaller (32-bit) group ID. The purpose of this scheme is to provide a way 
of allocating globally unique IPv6 multicast addresses without requiring a new 
global mechanism for doing so. Because IPv6 unicast addresses are already allo¬ 
cated globally in units of prefixes (see Section 2.6), it is possible to use bits of this 
prefix in multicast addresses, thereby leveraging the existing method of unicast 
address allocation for multicast use. For example, an organization receiving a uni¬ 
cast prefix allocation of 3ffe:ffff:l::/48 would also consequently receive a unicast- 
based multicast prefix allocation of ff3x:30:3ffe:ffff:l::/96, where x is any valid 
scope. SSM is also supported using this format by setting the prefix length and 
prefix fields to 0, effectively requiring the prefix ff3x::/32 (where x is any valid 
scope value) for use in all such IPv6 SSM multicast addresses. 

To create unique multicast addresses of link-local scope, a method based on 
IIDs can be used [RFC4489], which is preferred to unicast-prefix-based allocation 
when only link-local scope is required. In this case, another form of IPv6 multicast 
address structure is used (see Figure 2-14). 


0 16 

11111111 o|o|i|i| <=2 00000000 11111111 . 

Flags Scope Reserved Prefix Length IID Group ID 

(4 bits) (4 bits) (8 bits) (8 bits) (64 bits) (32 bits) 

Figure 2-14 The IPv6 link-scoped multicast address format. Applicable only to link- (or smaller) 
scoped addresses, the multicast address can be formed by combining an IPv6 interface 
ID and a group ID. The mapping is straightforward, and all such addresses use prefixes 
of the form ff3x: 0011/32, where x is the scope ID and is less than 3. 


The address format shown in Figure 2-14 is very similar to the format in Fig¬ 
ure 2-13, except that the Prefix Length field is set to 255, and instead of a prefix 
being carried in the subsequent field, an IPv6 IID is instead. The advantage of 
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this structure over the previous one is that no prefix need be supplied in forming 
fhe mulficasf address. In ad hoc nefworks where no roufers may be available, an 
individual machine can form unique mulficasf addresses based on ifs own IID 
wifhouf having fo engage in a complex agreemenf protocol. As sfafed before, fhis 
formaf works only for link- or node-local mulficasf scoping, however. When larger 
scopes are required, eifher unicasf-prefix-based addressing or permanenf mulfi¬ 
casf addresses are used. As an example of fhis formaf, a hosf wifh IID 02-11-22-33- 
44-55-66-77 would use mulficasf addresses of fhe form ff3x:0011:0211:2233:4455:66 
77:gggg:gggg, where x is a scope value of 2 or less and gggg:gggg is fhe hexadeci¬ 
mal nofafion for a 32-bif mulficasf group ID. 

The bif field we have yef fo discuss is fhe R bif field. If is used when unicasf- 
prefix-based mulficasf addressing is used (fhe P bif is sef) along wifh a mulficasf 
roufing protocol fhaf requires knowledge of a rendezvous poinf. 


Note 

A rendezvous point (RP) is the IP address of a router set up to handle multicast 
routing for one or more multicast groups. RPs are used by the PIM-SM proto¬ 
col [RFC4601] to help senders and receivers participating in the same multicast 
group to find each other. One of the problems encountered in deploying Internet¬ 
wide multicast has been locating rendezvous points. This scheme overloads the 
IPv6 multicast address to include an RP address. Therefore, It is simple to find an 
RP from a group address by just selecting the appropriate subset of bits. 


When the P bit is set, the modified format for a multicast address shown in 
Figure 2-15 is used. 


16 


11111111 


0ll 


0000 


>0 & <=64 


Flags Scope Reserved RIID Prefix Length Prefix Group ID 
(4 bits) (4 bits) (4 bits) (4 bits) (8 bits) (64 bits) (32 bits) 


Figure 2-15 The unicast IPv6 address of an RP can be embedded inside an IPv6 multicast address 
[RFC3956]. Doing so makes it straightforward to find an RP associated with an address 
for routing purposes. An RP is used by the multicast routing system in order to coordi¬ 
nate multicast senders with receivers when they are not on the same subnetwork. 


The formaf shown in Figure 2-15 is similar fo fhe one shown in Figure 2-13, 
buf SSM is nof used (so fhe prefix lengfh cannof be zero). In addifion, a new 4-bif 
field called fhe RIID is infroduced. To form fhe IPv6 address of an RP based on 
a mulficasf address of fhe form in Figure 2-15, fhe number of bifs indicafed in 
fhe Prefix Length field are exfracfed from fhe Prefix field and placed as fhe upper 
bifs in a fresh IPv6 address. Then, fhe confenfs of fhe RIID field are used as fhe 
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low-order 4 bits of the RP address. The rest is filled with zeros. As an example, 
consider a multicast address ff75:940:2001:db8:dead:beef:f00d:face. In this case, 
the scope is 5 (site-local), the RIID field has the value 9, and the prefix length is 
0x40 = 64 bits. The prefix itself is therefore 2001:db8:dead:beef, so the RP address 
is 2001:db8:dead:beef::9. More examples are given in [RFC3956]. 

As with IPv4, there are a number of reserved IPv6 multicast addresses. These 
addresses are grouped by scope, except for the variable-scope addresses men¬ 
tioned before. Table 2-13 gives a list of the major reservations from the IPv6 multi¬ 
cast space. Consult [IP6MA] for additional information. 


Table 2-13 Reserved addresses within the IPv6 multicast address space 


Address 

Scope 

Special Use 

Reference 

ff01::l 

Node 

All nodes 

[RPC4291] 

ff01::2 

Node 

All routers 

[RPC4291] 

ff01::fb 

Node 

mDNSv6 

[IDChes] 

ff02::l 


All nodes 

[RPC4291] 

ff02::2 

Link 

All routers 

[RPC4291] 

ff02::4 

Link 

DVMRP routers 

[RFC1075] 

ff02::5 


OSPFIGP 

[RFC2328] 

ff02::6 

Link 

OSPFIGP designated routers 

[RFC2328] 

ff02::9 

Link 

RIPng routers 

[RFC2080] 

ff02::a 

Link 

EIGRP routers 

[EIGRP] 

ff02::d 

Link 

PIM routers 

[RFC5059] 

ff02;:16 

Link 

MLDv2-capable routers 

[RFC3810] 

ff02::6a 

Link 

All snoopers 

[RFC4286] 

ff02::6d 

Link 

LL-MANET-routers 

[RFC5498] 

ff02::fb 

Link 

mDNSv6 

[IDChes] 

ff02::l:2 

Link 

All DHCP agents 

[RPC3315] 

ff02::l:3 

Link 

LLMNR 

[RFC4795] 

ff02::l:ffxx:xxxx 

Link 

Solicited-node address range 

[RFC4291] 

ff05::2 

Site 

All routers 

[RPC4291] 

ff05::fb 


mDNSv6 

[IDChes] 

ff05::l:3 


All DHCP servers 

[RPC3315] 

ffOx:: 

Variable 

Reserved 

[RFC4291] 

ff0x::fb 

Variable 

mDNSv6 

[IDChes] 

ff0x::101 

Variable 

NTP 

[RFC5905] 

ff0x::133 

Variable 

Aggregate Server Access Protocol 

[RPC5352] 

ff0x::18c 

Variable 

All AGs address (CAPWAP) 

[RFC5415] 

ff3x::/32 

(Special) 

SSM block 

[RPC4607] 
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2.5.5 Anycast Addresses 

An anycast address is a unicast IPv4 or IPv6 address that identifies a different host 
depending on where in the network it is used. This is accomplished by configur¬ 
ing Infernef roufers fo adverfise fhe same unicasf roufes from mulfiple locafions in 
fhe Infernef. Thus, an anycasf address refers nof fo a single hosf in fhe Infernef, buf 
fo fhe "mosf appropriafe" or "closesf" single hosf fhaf is responding fo fhe anycasf 
address. Anycasf addressing is used mosf frequenfly for finding a compufer fhaf 
provides a common service [RFC4786]. For example, a dafagram senf fo an anycasf 
address could be used fo find a DNS server (see Chapfer 11), a 6fo4 gafeway fhaf 
encapsulafes IPv6 fraffic in IPv4 funnels [RFC3068], or RPs for mulficasf roufing 
[RFC4610]. 

2.6 Allocation 

IP address space is allocated, usually in large chunks, by a collecfion of hierarchi¬ 
cally organized authorities. The aufhorifies are generally organizafions fhaf allo- 
cafe address space fo various owners—usually ISPs or ofher smaller aufhorifies. 
Aufhorifies are mosf offen involved in allocafing porfions of fhe global unicasf 
address space, buf ofher fypes of addresses (mulficasf and special-use) are also 
somefimes allocafed. The aufhorifies can make allocafions fo users for an undefer- 
mined amounf of fime, or for a limifed fime (e.g., for running experimenfs). The 
fop of fhe hierarchy is fhe I ANA [I ANA], which has wide-ranging responsibil- 
ify for allocafing IP addresses and ofher fypes of numbers used in fhe Infernef 
profocols. 

2.6.1 Unicast 

For unicast IPv4 and IPv6 address space, the lANA delegates much of its allocation 
authority to a few regional Internet registries (RIRs). The RIRs coordinate with each 
other through an organization formed in 2003 called the Number Resource Orga¬ 
nization (NRO) [NRO]. At the time of writing (mid-2011), the set of RIRs includes 
those shown in Table 2-14, all of which participate in the NRO. Note in addition 
that, as of early 2011, all the remaining unicast IPv4 address space held by lANA 
for allocation had been handed over to these RIRs. 

These entities typically deal with relatively large address blocks [IP4AS] 
[IP6AS]. They allocate address space to smaller registries operating in countries 
(e.g., Australia and Singapore) and to large ISPs. ISPs, in turn, provide address 
space to their customers and themselves. When users sign up for Internet ser¬ 
vice, they are ordinarily provided a (typically small) fraction or range of their 
ISP's address space in the form of an address prefix. These address ranges are 
owned and managed by the customer's ISP and are called provider-aggregatable 
(PA) addresses because they consist of one or more prefixes that can be aggregated 
with other prefixes the ISP owns. Such addresses are also sometimes called non¬ 
portable addresses. Switching providers typically requires customers to change the 
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Table 2-14 Regional Internet registries that participate in the NRO 


RIR Name 

Area of Responsibility 

Reference 

AfriNIC—^African Network 
Information Center 

Africa 

http: / / www.afrinic.net 

APNIC—Asia Pacific Network 
Information Center 

Asia/Pacific Area 

http://WWW. apnic .net 

ARIN—^American Registry for 
Internet Numbers 

North America 

http: / / www.arin.net 

LACNIC—Regional Latin 

America and Caribbean IP 
Address Registry 

Latin America and some 
Caribbean islands 

http://lacnic.net/en/index.html 

RIPE NCC—Reseaux IP 

Europeens 

Europe, Middle East, 
Central Asia 

http: / / www.ripe.net 


IP prefixes on all computers and routers they have that are attached to the Internet 
(an often unpleasant operation called renumbering). 

An alternative type of address space is called provider-independent (PI) address 
space. Addresses allocated from PI space are allocated to the user directly and 
may be used with any ISP. However, because such addresses are owned by the 
customer, they are not numerically adjacent to the ISP's own addresses and are 
therefore not aggregatable. An ISP being asked to provide routing for a customer's 
PI addresses may require additional payment for service or simply not agree to 
support such a configuration. In some sense, an ISP that agrees to provide routing 
for a customer's PI addresses is taking on an extra cost relative to other customers 
by having to increase the size of its routing tables. On the other hand, many sites 
prefer to use PI addresses, and might be willing to pay extra for them, because 
it helps to avoid the need to renumber when switching ISPs (avoiding what has 
become known as provider lock). 

2.6.1.1 Examples 

It is possible to use the Internet WHOIS service to determine how address space 
has been allocated. For example, we can form a query for information about the 
IPv4 address 72.1.140.203 by accessing the corresponding URL http: //whois. 
arin.net/rest/ip/72.1.140.203.txt: 


NetRange: 
CIDR: 
OriginAS: 
NetName: 
NetHandle: 
Parent: 
NetType: 
RegDate: 
Updated: 
Ref: 


72.1.140.192 - 72.1.140.223 
72.1.140.192/27 

SPEK-SEA5-PART-1 

NET-72-1-140-192-1 

NET-72-1-128-0-1 

Reassigned 

2005-06-29 

2005-06-29 

http://whois.arin.net/rest/net/NET-72-1-140-192-1 
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Here we see that the address 72.1.140.203 is really part of the network called 
SPEK-SEA5-PART-1, whichhasbeenallocated the address range 72.1.140.192/27. 
Furthermore, we can see that SPEK-SEA5-PART-l's address range is a portion of 
fhe PA address space called NET-72-1-128-0-1. We can formulafe a query for 
informafion abouf fhis nefwork by visifing fhe URL http: //whois.arin.net/ 
rest/net/NET-72-1-12 8-0-l.txt: 


NetRange: 
CIDR: 
OriginAS: 
NetName: 
NetHandle: 
Parent: 
NetType: 
RegDate: 
Updated: 
Ref: 


72.1.128.0 - 72.1.191.255 
72.1.128.0/18 

SPEAKEASY-6 

NET-72-1-128-0-1 

NET-72-0-0-0-0 

Direct Allocation 

2004-09-09 

2009-05-19 

http://whois.arin.net/rest/net/NET-72-1-128-0-1 


This record indicafes fhaf fhe address range 72.1.128.0/18 (called by fhe "han¬ 
dle" or name NET-72-1-128-0-1) has been direcfly allocafed ouf of fhe address 
range 72.0.0.0/8 managed by ARIN. More defails on dafa formafs and fhe vari¬ 
ous mefhods ARIN supporfs for WHOIS queries can be found af [WRWS]. 

We can look af a differenf fype of resulf using one of fhe ofher RIRs. For exam¬ 
ple, if we search for informafion regarding fhe IPv4 address 193.5.93.80 using 
fhe Web query inf erf ace af http: //www. ripe. net/who is, we obfain fhe follow¬ 
ing resulf: 


% This is the RIPE Database query service. 
% The objects are in RPSL format. 


% The RIPE Database is subject to Terms and Conditions. 

% See http://www.ripe.net/db/support/db-terms-conditions.pdf 


% Note: This output has been filtered. 

% To receive output for a database update, use the "-B" flag. 

% Information related to '193.5.88.0 - 193.5.95.255' 
inetnum: 193.5.88.0 - 193.5.95.255 

netname: WIPONET 


descr: 
descr: 
descr: 
country: 
admin-c: 
tech-c: 
status: 
mnt-by: 
mnt-by: 


World Intellectual Property Organization 
UN Specialized Agency 
Geneva 
CH 

AM4504-RIPE 
AM4504-RIPE 
ASSIGNED PI 
CH-UNISOURCE-MNT 
DE-COLT-MNT 
RIPE # Filtered 


source: 
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Here, we can see that the address 193.5.93.80 is a portion of the 193.5.88.0/21 
block allocated to WIPO. Note that the status of fhis block is ASSIGNED PI, mean¬ 
ing fhaf fhis parficular block of addresses is of fhe provider-independenf variefy. 
The reference fo RPSL indicafes fhaf fhe dafabase records are in fhe Routing Policy 
Specification Language [RFC2622][RFC4012], used by ISPs fo express fheir roufing 
policies. Such informafion allows nefwork operators fo configure roufers fo help 
minimize Infernef roufing insfabilifies. 

2.6.2 Multicast 

In IPv4 and IPv6, multicast addresses (i.e., group addresses) can be described based 
on their scope, the way they are determined (statically, dynamically by agreement, 
or algorithmically), and whether they are used for ASM or SSM. Guidelines have 
been constructed for allocation of these groups ([RFC5771] for IPv4; [RFC3307] for 
IPv6) and the overall architecture is detailed in [RFC6308]. The groups that are 
not of global scope (e.g., administratively scoped addresses and IPv6 link-scoped 
multicast addresses) can be reused in various parts of the Internet and are either 
configured by a network administrator out of an administratively scoped address 
block or selected automatically by end hosts. Globally scoped addresses that are 
statically allocated are generally fixed and may be hard-coded into applications. 
This type of address space is limited, especially in IPv4, so such addresses are 
really intended for uses applicable to any Internet site. Algorithmically deter¬ 
mined globally scoped addresses can be created based on AS numbers, as in 
GLOP, or an associated unicast prefix allocation. Note that SSM can use globally 
scoped addresses (i.e., from the SSM block), administratively scoped addresses, or 
unicast-prefix-based IPv6 addresses where the prefix is effectively zero. 

As we can see from the relatively large number of protocols and the complex¬ 
ity of the various multicast address formats, multicast address management is a 
formidable issue (not to mention global multicast routing [RFG5110]). From a typi¬ 
cal user's point of view, multicasting is used rarely and may be of limited concern. 
From a programmer's point of view, it may be worthwhile to support multicast 
in application designs, and some insight has been provided into how to do so 
[RFG3170]. For network administrators faced with implementing multicast, some 
interaction with the service provider is likely necessary. In addition, some guide¬ 
lines tor multicast address allocation have been developed by vendors [GGEMA]. 


2.7 Unicast Address Assignment 

Once a site has been allocated a range of unicast IP addresses, typically from its 
ISP, the site or network administrator must determine how to assign addresses in 
the address range to each network interface and how to set up the subnet structure. 
If the site has only a single physical network segment (e.g., most private homes), 
this process is relatively straightforward. For larger enterprises, especially those 
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receiving service from multiple ISPs and that use multiple physical network seg¬ 
ments distributed over a large geographical area, this process can be complicated. 
We shall begin to see how this works by looking at the case where a home user 
uses a private address range and a single IPv4 address provided by an ISP. This is 
a common scenario today. We then move on to provide some introductory guid¬ 
ance for more complicated situations. 

2.7.1 Single Provider/No Network/Single Address 

The simplest type of Internet service that can be obtained today is to receive a single 
IP address (typically IPv4 only in the United States) from an ISP to be used with a 
single computer. For services such as DSL, the single address might be assigned as 
the end of a point-to-point link and might be temporary. For example, if a user's 
computer connects to the Internet over DSL, it might be assigned the address 
63.204.134.177 on a particular day. Any running program on the computer may send 
and receive Internet traffic, and any such traffic will carry the source IPv4 address 
63.204.134.177. Even a host this simple has other active IP addresses as well. These 
include the local "loopback" address (127.0.0.1) and some multicast addresses, includ¬ 
ing, at a minimum, the All Hosts multicast address (224.0.0.1). If the host is running 
IPv6, at a minimum it is using the All Nodes IPv6 multicast address (ff02::l), any 
IPv6 addresses it has been assigned by the ISP, the IPv6 loopback address (::1), and a 
link-local address for each network interface configured for IPv6 use. 

To see a host's active multicast addresses (groups) on Linux, we can use the 
ifconf ig and netstat commands to see the IP addresses and groups in use: 


Linux% ifconfig pppO 

pppO Link encap:Point-to-Point Protocol 

inet addr:71.141.244.213 

P-t-P:71.141.255.254 Mask:255.255.255.255 

UP POINTOPOINT RUNNING NOARP MULTICAST MTU: 1492 Metric:! 
RX packets:33134 errors:0 dropped:0 overruns:0 frame:0 
TX packets:41031 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:3 

RX bytes:17748984 (16.9 MiB) TX bytes:9272209 (8.8 MiB) 

Linux% netstat -gn 

IPv6/IPv4 Group Memberships 
Interface RefCnt Group 


lo 

pppO 

pppO 

lo 


1 224.0.0.1 

1 224.0.0.251 

1 224.0.0.1 

1 ff02::l 


Here we see that the point-to-point link associated with the device pppO 
has been assigned the IPv4 address 71.141.244.213; no IPv6 address has been 
assigned. The host system does have IPv6 enabled, however, so when we inspect 
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its group memberships we see that it is subscribed to the IPv6 All Nodes multicast 
group on its local loopback (lo) interface. We can also see that the IPv4 All Hosts 
group is in use, in addition to the mDNS (multicast DNS) service [IDChes]. The 
mDNS protocol uses the static IPv4 multicast address 224.0.0.251. 

2.7.2 Single Provider/Single Network/Single Address 

Many Internet users who own more than one computer find fhaf having only a 
single compufer attached fo fhe Infernef is nof an ideal sifuafion. As a resulf, fhey 
have home LAN or WLAN nefworks and use eifher a roufer or a compufer acfing 
as a roufer fo provide connecfivify fo fhe Infernef. Such configurafions are very 
similar fo fhe single-compufer case, excepf fhe roufer forwards packefs from fhe 
home nefwork fo fhe ISP and also performs NAT (see Chapfer 7; also called Inter¬ 
net Connection Sharing (ICS) in Windows) by rewrifing fhe IP addresses in packefs 
being exchanged wifh fhe cusfomer's ISP. From fhe ISP's poinf of view, only a 
single IP address has been used. Today, much of fhis acfivify is aufomafed, so fhe 
need for manual address configurafion is minimal. The roufers provide aufomafic 
address assignmenf fo fhe home clienfs using DHCP They also handle address 
assignmenf for fhe link sef up wifh fhe ISP if necessary. Defails of DHCP operafion 
and hosf configurafion are given in Chapfer 6. 

2.7.3 Single Provider/Multiple Networks/Multiple Addresses 

Many organizafions find fhaf fhe allocafion of a single unicasf address, especially 
if if is only femporarily assigned, is insufficienf for fheir Infernef access needs. 
In parficular, organizafions infending fo run Infernef servers (such as Web sifes) 
generally wish fo have an IP address fhaf does nof change over fime. These sifes 
also often have mulfiple LANs; some of fhem are infernal (separafed from fhe 
Infernef by firewalls and NAT devices), and ofhers may be exfernal (providing 
services fo fhe Infernef). For such nefworks, fhere is fypically a sife or nefwork 
adminisfrafor who musf decide how many IP addresses fhe sife requires, how 
fo sfrucfure subnefs af fhe sife, and which subnefs should be infernal and which 
exfernal. The arrangemenf shown in Figure 2-16 is fypical for small and medium- 
size enferprises. 

In fhis figure, a sife has been allocafed fhe prefix 128.32.2.64/26, providing 
up fo 64 (minus 2) roufable IPv4 addresses. The "DMZ" nefwork ("demilifarized 
zone" nefwork, oufside fhe primary firewall; see Chapfer 7) is used fo affach serv¬ 
ers fhaf can be accessed by users on fhe Infernef. Such compufers fypically pro¬ 
vide Web access, login servers, and ofher services. These servers are assigned IP 
addresses from a small subsef of fhe prefix range; many sifes have only a few 
public servers. The remaining addresses from fhe sife prefix are given fo fhe NAT 
roufer as fhe basis for a "NAT pool" (see Chapfer 7). This roufer can rewrife dafa- 
grams enfering and leaving fhe infernal nefwork using any of fhe addresses in 
ifs pool. The nefwork sefup in Figure 2-16 is convenienf for fwo primary reasons. 
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Figure 2-16 A typical small to medium-size enterprise network. The site has been allocated 64 
public (routable) IPv4 addresses in the range 128.32.2.64/26. A "DMZ" network holds 
servers that are visible to the Internet. The internal router provides Internet access for 
computers internal to the enterprise using NAT. 


First, the separation of the internal network from fhe DMZ helps profecf infernal 
compufers from damage should fhe DMZ servers be compromised. In addifion, 
fhis sefup parfifions fhe IP address assignmenf. Once fhe border roufer, DMZ, and 
infernal NAT roufer have been sef up, any address sfrucfure can be used infer¬ 
nally, where many (privafe) IP addresses are available. Of course, fhis example 
is only one way of seffing up small enferprise nefworks, and ofher facfors such 
as cosf mighf ulfimafely drive fhe way roufers, nefworks, and IP addresses are 
deployed for any parficular small or medium-size enferprise. 

2.7.4 Multiple Providers/Multiple Networks/Multiple Addresses (Multihoming) 

Some organizations that depend on Internet access for their continued operations 
attach to the Internet using more than one provider (called multihoming) in order 
to provide for redundancy in case of failure, or for other reasons. Because of CIDR, 
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organizations with a single ISP tend to have PA IP addresses associated with that 
ISP. If they obtain a second ISP, the question arises as to what IP addresses should 
be used in each of fhe hosfs. Some guidance has been developed for operafing 
wifh mulfiple ISPs, or when fransifioning from one fo anofher (which raises some 
similar concerns). For IPv4, [RFC4116] discusses how eifher PI or PA addresses can 
be used for mulfihoming. Consider fhe sifuafion shown in Figure 2-17. 



Figure 2-17 Provider-aggregatable and provider-independent IPv4 addresses used in a hypothetical 
multihomed enterprise. Site operators tend to prefer using PI space if it is available. ISPs 
prefer PA space because it promotes prefix aggregation and reduces routing table size. 


Here, a (somewhat) fictitious site S has two ISPs, PI and P2. If it uses PA address 
space from Pi's block (12.46.129.0/25), it advertises this prefix at points C and D to 
PI and P2, respectively. The prefix can be aggregated by PI into its 12/8 block in 
advertisements to the rest of the Internet at point A, but P2 is not able to aggregate 
it at point B because it is not numerically adjacent to its own prefix (137.164/16). 
In addition, from the point of view of some host in the other parts of the Internet, 
traffic for 12.46.129.0/25 tends to go through ISP P2 rather than ISP PI because the 
prefix for site S is longer ("more specific") than when it goes through PI. This is 
a consequence of the way the longest matching prefix algorithm works for Internet 
routing (see Chapter 5 for more details). In essence, a host in the other parts of the 
Internet could reach the address 12.46.129.1 via either a matching prefix 12.0.0.0/8 
at point A or the prefix 12.46.129.0/25 at point B. Because each prefix matches (i.e., 
contains a common set of prefix bits with the destination address 12.46.129.1), the 
one with the larger or longer mask (larger number of matching bits) is preferred. 
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which in this case is P2. Thus, P2 is in the position of being unable to aggregate the 
prefix from S and also winds up carrying mosf of S's fraffic. 

If sife S decides fo use PI space insfead of PA space, fhe sifuafion is more symmef- 
ric. However, no aggregafion is possible. In fhis case, fhe PI prefix 198.134.135.0/24 
is adverfised fo PI and P2 af poinfs C and D, respecfively, buf neifher ISP is able 
fo aggregafe if because if is nof numerically adjacenf fo eifher of fhe ISPs' address 
blocks. Thus, bofh ISPs adverfise fhe idenfical prefix 198.134.135.0/24 af poinfs A 
and B. In fhis fashion fhe "nafural" shorfesf-pafh compufafions in Infernef rouf- 
ing can fake place, and sife S can be reached by whichever ISP is closer fo fhe hosf 
sending fo if. In addifion, if sife S decides fo swifch ISPs, if does nof have fo change 
ifs assigned addresses. Unforfunafely, fhe inabilify fo aggregafe such addresses 
can be a concern for fufure scalabilify of fhe Infernef, so PI space is in relafively 
shorf supply. 

Mulfihoming for IPv6 has been fhe subjecf of sfudy wifhin fhe IETF for 
some fime, resulfing in fhe MultiG archifecfure [RFC4177] and fhe ShimG profo- 
col [RFC5533]. Mulfi6 ouflines a number of approaches fhaf have been proposed 
for handling fhe issue. Broadly, fhe opfions menfioned include using a roufing 
approach equivalenf fo IPv4 mulfihoming menfioned previously, using fhe capa- 
bilifies of Mobile IPvG [RFC6275], and creafing a new mefhod fhaf splifs fhe iden- 
fificafion of nodes away from fheir locators. Today, IP addresses serve as bofh 
identifiers (essenfially a form of name) and locators (an address understood by fhe 
roufing sysfem) for a nefwork interface affached fo fhe Infernef. Providing a sepa- 
rafion would allow fhe nefwork profocol implemenfafion fo funcfion even if fhe 
underlying IP address changes. Profocols fhaf provide fhis separafion are some- 
fimes called identifier/locator separating or id/loc split profocols. 

Shim6 infroduces a "shim" nefwork-layer profocol fhaf separates fhe "upper- 
layer profocol idenfifier" used by fhe fransporf profocols from fhe IP address. 
Mulfihoming is achieved by selecfing which IP address (locator) fo use based 
on dynamic nefwork condifions and wifhouf requiring PI address allocafions. 
Communicafing hosfs (peers) agree on which locafors fo use and when fo swifch 
befween fhem. Separafion of idenfifiers from locafors is fhe subjecf of several ofher 
efforfs, including fhe experimenfal Host Identity Protocol (HIP) [RFC4423], which 
idenfifies hosfs using cryptographic hosf idenfifiers. Such idenfifiers are effec- 
fively fhe public keys of public/private key pairs associafed wifh hosfs, so HIP 
fraffic can be aufhenficafed as having come from a parficular hosf. Securify issues 
are discussed in more defail in Chapfer 18. 


2.8 Attacks Involving IP Addresses 

Given fhaf IP addresses are essenfially numbers, few nefwork affacks involve only 
fhem. Generally, affacks can be carried ouf when sending "spoofed" dafagrams (see 
Ghapfer 5) or wifh ofher related acfivifies. Thai said, IP addresses are now being 
used fo help idenfify individuals suspected of undesirable acfivifies (e.g., copyrighf 
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infringement in peer-to-peer networks or distribution of illegal materials). Doing 
this can be misleading for several reasons. For example, in many circumstances 
IP addresses are only temporary and are reassigned to different users at different 
times. Therefore, any errors in accurate timekeeping can easily cause databases 
that map IP addresses to users to be incorrect. Furthermore, access controls are not 
widely and securely deployed; it is often possible to attach to the Internet through 
some public access point or some unintentionally open wireless router in some¬ 
one's home or office. In such circumstances, the unsuspecting home or business 
owner may be targeted based on IP address even though that person was not the 
originator of traffic on the network. This can also happen when compromised hosts 
are used to form botnets. Such collections of computers (and routers) can now be 
leased on what has effectively become an Internet-based black market for carrying 
out attacks, serving illicit content, and other misdeeds [RFC4948]. 


2.9 Summary 

The IP address is used to identify and locate network interfaces on devices 
throughout the Internet system (unicast addresses). It may also be used for iden¬ 
tifying more than one such interface (multicast, broadcast, or anycast addresses). 
Each interface has a minimum of one 32-bit IPv4 address (when IPv4 is being 
used) and usually has several 128-bit addresses if using IPv6. Unicast addresses 
are allocated in blocks by a hierarchically structured set of administrative entities. 
Prefixes allocated by such entities represent a chunk of unicast IP address space 
typically given to ISPs that in turn provide addresses to their users. Such prefixes 
are usually a subrange of the ISP's address block (called provider-aggregatable or 
PA addresses) but may instead be owned by the user (called provider-indepen- 
dent or PI addresses). Numerically adjacent address prefixes (PA addresses) can 
be aggregated to save routing table space and improve scalability of the Internet. 
This approach arose when the Internet's "classful" network structure consist¬ 
ing of class A, B, and C network numbers was abandoned in favor of classless 
inter-domain routing (CIDR). CIDR allows for different sizes of address blocks to 
be assigned to organizations with different needs for address space; essentially, 
CIDR enables more efficient allocation of address space. Anycast addresses are 
unicast addresses that refer to different hosts depending on where the sender is 
located; such addresses are often used for discovering network services that may 
be present in multiple locations. 

IPv6 unicast addresses differ somewhat from IPv4 addresses. Most important, 
IPv6 addresses have a scope concept, for both unicast and multicast addresses, 
that specifically indicates where an address is valid. Typical scopes include node¬ 
local, link-local, and global. Link-local addresses are often created based on a stan¬ 
dard prefix in combination with an IID that can be based on addresses provided 
by lower-layer protocols (such as hardware/MAC addresses) or random values. 
This approach aids in autoconfiguration of IPv6 addresses. 
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Both IPv4 and IPv6 support addressing formats that refer fo more fhan one 
nefwork inferface af a fime. Broadcasf and mulficasf addresses are supporfed in 
IPv4, buf only mulficasf addresses are supporfed in IPv6. Broadcasf allows for one- 
fo-all communicafion, whereas mulficasf allows for one-fo-many communicafion. 
Senders send fo mulficasf groups (IP addresses) fhaf acf somewhaf like felevision 
channels; fhe sender has no direcf knowledge of fhe recipienfs of ifs fraffic or 
how many receivers fhere are on a channel. Global mulficasf in fhe Infernef has 
evolved over more fhan a decade and involves many profocols—some for roufing, 
some for address allocafion and coordinafion, and some for signaling fhaf a hosf 
wishes fo join or leave a group. There are also many fypes and uses of IP mulfi¬ 
casf addresses, bofh in IPv4 and (especially) in IPv6. Varianfs of fhe IPv6 mulfi¬ 
casf address formaf provide ways for allocafing groups based on unicasf prefixes, 
embedding roufing informafion (RP addresses) in groups, and creafing mulficasf 
addresses based on IIDs. 

The developmenf and deploymenf of CIDR was arguably fhe lasf fundamen- 
fal change made fo fhe Infernef's core roufing sysfem. CIDR was successful in 
handling fhe pressure fo have more flexibilify in allocafing address space and 
for promofing roufing scalabilify fhrough aggregafion. In addifion, IPv6 was pur¬ 
sued af fhe fime (early 1990s) wifh much energy, based on fhe belief fhaf a much 
larger number of addresses would be required soon. Unforeseen af fhe fime, fhe 
widespread use of NAT (see Chapfer 7) has since significanfly delayed adopfion of 
IPv6 by nof requiring every hosf affached fo fhe Infernef fo have a unique address. 
Insfead, large nefworks using privafe address space are now commonplace. Ulfi- 
mafely, however, fhe number of available roufable IP addresses will evenfually 
dwindle fo zero, so some change will be required. In February 2011 fhe lasf five /8 
IPv4 address prefixes were allocafed from fhe lANA, one fo each of fhe five RIRs. 
On April 15, 2011, APNIC exhausfed all of ifs allocafable prefixes. The remain¬ 
ing prefixes held by various RIRs are expecfed fo remain unallocafed for only a 
few years af mosf. A currenf snapshof of IPv4 address ufilizafion can be found af 
[IP4R]. 
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Link Layer 


3.1 Introduction 

In Chapter 1, we saw that the purpose of the link layer in the TCP/IP protocol suite 
is to send and receive IP datagrams for fhe IP module. If is also used fo carry a 
few ofher protocols fhaf help supporf IP, such as ARP (see Chapfer 4). TCP/IP sup- 
porfs many differenf link layers, depending on fhe type of nefworking hardware 
being used: wired LANs such as Efhernef, metropolitan area networks (MANs) such 
as cable TV and DSL connecfions available fhrough service providers, and wired 
voice nefworks such as felephone lines wifh modems, as well as fhe more recenf 
wireless nefworks such as Wi-Fi (wireless LAN) and various wireless dafa ser¬ 
vices based on cellular fechnlology such as HSPA, EV-DO, LTE, and WiMAX. In 
fhis chapfer we shall look af some of fhe defails involved in using fhe Efhernef and 
Wi-Fi link layers, how fhe Point-to-Point Protocol (PPP) is used, and how link-layer 
profocols can be carried inside ofher (link- or higher-layer) protocols, a fechnique 
known as funneling. Covering fhe defails of every link technology available today 
would require a separate fexf, so we instead focus on some of fhe mosf commonly 
used link-layer profocols and how fhey are used by TCP/IP 

Mosf link-layer fechnologies have an associafed protocol formal fhaf describes 
how fhe corresponding PDUs musf be consfrucfed in order fo be carried by fhe 
nefwork hardware. When referring fo link-layer PDUs, we usually use fhe ferm 
frame, so as fo disfinguish fhe PDU formal from fhose af higher layers such as 
packefs or segmenfs, ferms used fo describe nefwork- and fransporf-layer PDUs, 
respecfively. Frame formafs usually supporf a variable-lengfh frame size ranging 
from a few bytes fo a few kilobyfes. The upper bound of fhe range is called fhe 
maximum transmission unit (MTU), a characferisfic of fhe link layer fhaf we shall 
encounter numerous limes in fhe remaining chapfers. Some nefwork fechnolo¬ 
gies, such as modems and serial lines, do nof impose fheir own maximum frame 
size, so fhey can be configured by fhe user. 
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3.2 Ethernet and the IEEE 802 LAN/MAN Standards 

The term Ethernet generally refers to a set of standards first published in 1980 and 
revised in 1982 by Digital Equipment Corp., Intel Corp., and Xerox Corp. The first 
common form of Ethernet is now sometimes called "lOMb/s Ethernet" or "shared 
Ethernet," and it was adopted (with minor changes) by the IEEE as standard number 
802.3. Such networks were usually arranged like the network shown in Eigure 3-1. 



Figure 3-1 A basic shared Ethernet network consists of one or more stations (e.g., workstations, 
supercomputers) attached to a shared cable segment. Link-layer PDUs (frames) can be 
sent from one station to one or more others when the medium is determined to be free. 
If multiple stations send at the same time, possibly because of signal propagation delays, 
a collision occurs. Collisions can be detected, and they cause sending stations to wait a 
random amount of time before retrying. This common scheme is called carrier sense, 
multiple access with collision detection. 


Because multiple stations share the same network, this standard includes a 
distributed algorithm implemented in each Ethernet network interface that con¬ 
trols when a station gets to send data it has. The particular method, known as 
carrier sense, multiple access with collision detection (CSMA/CD), mediates which 
computers can access the shared medium (cable) without any other special agree¬ 
ment or synchronization. This relative simplicity helped to promote the low cost 
and resulting popularity of Ethernet technology. 

With CSMA/CD, a station (e.g., computer) first looks for a signal currently 
being sent on the network and sends its own frame when the network is free. 
This is the "carrier sense" portion of the protocol. If some other station happens 
to send at the same time, the resulting overlapping electrical signal is detected as 
a collision. In this case, each station waits a random amount of time before try¬ 
ing again. The amount of time is selected by drawing from a uniform probability 
distribution that doubles in length each time a subsequent collision is detected. 
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Eventually, each station gets its chance to send or times out trying after some 
number of affempfs (16 in fhe case of convenfional Efhernef). Wifh CSMA/CD, 
only one frame is fraveling on fhe nefwork af any given fime. Access mefhods such 
as CSMA/CD are more formally called Media Access Control (MAC) protocols. 
There are many fypes of MAC profocols; some are based on having each sfafion 
fry to use fhe nefwork independenfly (confenfion-based profocols like CSMA/ 
CD), and ofhers are based on prearranged coordinafion (e.g., by allocafing fime 
slofs for each sfafion to send). 

Since fhe developmenf of lOMb/s Efhernef, faster computers and infrasfruc- 
fure have driven fhe need for ever-increasing speeds in LANs. Given fhe popu- 
larify of Efhernef, significanf innovafion and efforf have managed to increase ifs 
speed from lOMb/s fo 100Mb/s fo lOOOMb/s fo lOGb/s, and now to even more. 
The lOGb/s form is becoming popular in larger dafa centers and large enterprises, 
and speeds as high as lOOGb/s have been demonsfrafed. The very firsf (research) 
Efhernef ran af 3Mb/s, buf fhe DIX (Digifal, Inf el. Xerox) sfandard ran af lOMb/s 
over a shared physical cable or sef of cable segmenfs inferconnecfed by elecfri- 
cal repeaters. % fhe early 1990s, fhe shared cable had largely been replaced by 
fwisfed-pair wiring (resembling felephone wires and of fen called "lOBASE-T"). 
Wifh fhe developmenf of lOOMb/s (also called "fasf Efhernef," fhe mosf popular 
version of which is known as "lOOBASE-TX"), confenfion-based MAG profocols 
have become less popular. Insfead, fhe wiring befween each LAN sfafion is often 
nof shared buf insfead provides a dedicafed elecfrical pafh in a sfar fopology. This 
can be accomplished wifh Efhernef switches, as shown in Eigure 3-2. 



Figure 3-2 A switched Ethernet network consists of one or more stations, each of which is attached 
to a switch port using a dedicated wiring path. In most cases where switched Ethernet is 
used, the network operates in a full-duplex fashion and the CSMA/CD algorithm is not 
required. Switches may be cascaded to form larger Ethernet LANs by interconnecting 
switch ports, sometimes called "uplink" ports. 
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At present, switches are commonly used, providing each Ethernet station with 
the ability to send and receive data simultaneously (called "full-duplex Ethernet"). 
Although half-duplex (one direcfion af a fime) operafion is sfill supporfed even by 
lOOOMb/s Efhernef (lOOOBASE-T), if is rarely used relafive fo full-duplex Efhernef. 
We shall discuss how swifches process PDUs in more defail lafer. 

One of fhe mosf popular fechnologies used fo access fhe Infernef foday is 
wireless nefworking, fhe mosf common for wireless local area nefworks (WLANs) 
being an IEEE sfandard known as Wireless Eidelify or Wi-Fi, and somefimes 
called "wireless Efhernef" or 802.11. Alfhough fhis sfandard is disfincf from fhe 
802 wired Efhernef sfandards, fhe frame formaf and general inferface are largely 
borrowed from 802.3, and all are parf of fhe sef of IEEE 802 LAN sfandards. Thus, 
mosf of fhe capabilifies used by TCP/IP for Efhernef nefworks are also used for 
Wi-Ei nefworks. We shall explore each of fhese in more defail. Eirsf, however, if 
is useful fo gef a bigger picfure of all of fhe IEEE 802 sfandards fhaf are relevanf 
for seffing up home and enferprise nefworks. We also include references fo fhose 
IEEE sfandards governing MAN sfandards, including IEEE 802.16 (WiMAX) and 
fhe sfandard for media-independenf handoffs in cellular nefworks (IEEE 802.21). 

3.2.1 The IEEE 802 LAN/MAN Standards 

The original Efhernef frame formaf and operafion were described by indusfry 
agreemenf, menfioned earlier. This formaf was known as fhe DIX formaf or Efh¬ 
ernef II formaf. This type of Efhernef nefwork, wifh slighf modificafion, was lafer 
sfandardized by fhe IEEE as a form of CSMA/CD nefwork, called 802.3. In fhe 
world of IEEE sfandards, sfandards wifh fhe prefix 802 define fhe operafions of 
LANs and MANs. The mosf popular 802 sfandards foday include 802.3 (essen- 
fially Efhernef) and 802.11 (WLAN/Wi-Ei). These sfandards have evolved over 
fime and have changed names as freesfanding amendmenfs (e.g., 802.11g) are 
ulfimafely incorporafed in revised sfandards. Table 3-1 shows a fairly complefe 
lisf of fhe IEEE 802 LAN and MAN sfandards relevanf fo supporfing fhe TCP/IP 
profocols, as of mid-2011. 


Table 3-1 LAN and MAN IEEE 802 standards relevant to the TCP/IP protocols (2011) 


Name 

Description 

Official Reference 

802.1ak 

Multiple Registration Protocol (MRP) 

[802.1AK-2007] 

802.1AE 

MAC Security (MACSec) 

[802.1AE-2006] 

802.1AX 

Link Aggregation (formerly 802.3ad) 

[802.1AX-2008] 

802.1d 

MAC Bridges 

[802.1D-2004] 

802.1p 

Traffic classes/priority/QoS 

[802.1D-2004] 

802.1q 

Virtual Bridged LANs/Corrections to MRP 

[802.1Q-2005/Corl-2008] 

802.1s 

Multiple Spanning Tree Protocol (MSTP) 

[802.1Q-2005] 
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Table 3-1 LAN and MAN IEEE 802 standards relevant to the TCP/IP protocols (2011) {continued) 


Name 

Description 

Official Reference 

802.1W 

Rapid Spanning Tree Protocol (RSTP) 

[802.1D-2004] 

802.1X 

Port-Based Network Access Control (PNAC) 

[802.1X-2010] 

802.2 

Logical Link Control (LLC) 

[802.2-1998] 

802.3 

Baseline Ethernet and lOMb/s Ethernet 

[802.3-2008] (Section One) 

802.3U 

lOOMb/s Ethernet ("East Ethernet") 

[802.3-2008] (Section Two) 

802.3X 

Full-duplex operation and flow control 

[802.3-2008] 

802.3z/802.3ab 

lOOOMb/s Ethernet ("Gigabit Ethernet") 

[802.3-2008] (Section 

Three) 

802.3ae 

lOGb/s Ethernet ("Ten-Gigabit Ethernet") 

[802.3-2008] (Section Four) 

802.3ad 

Link Aggregation 

[802.1AX-2008] 

802.3af 

Power over Ethernet (PoE) (to 15.4W) 

[802.3-2008] (Section Two) 

802.3ah 

Access Ethernet ("Ethernet in the First Mile 
(EFM)") 

[802.3-2008] (Section Five) 

802.3as 

Frame format extensions (to 2000 bytes) 

[802.3-2008] 

802.3at 

Power over Ethernet enhancements ("PoE-i-" to 
30W) 

[802.3at-2009] 

802.3ba 

40/100Gb/s Ethernet 

[802.3ba-2010] 

802.11a 

54Mb/s Wireless LAN at 5GHz 

[802.11-2007] 

802.11b 

llMb/s Wireless LAN at 2.4GHz 

[802.11-2007] 

802.11e 

QoS enhancement for 802.11 

[802.11-2007] 

802.11g 

54Mb/s Wireless LAN at 2.4GHz 

[802.11-2007] 

802.11h 

Spectrum/power management extensions 

[802.11-2007] 

802.11i 

Security enhancements/replaces WEP 

[802.11-2007] 

802.11] 

4.9-5.0GHz operation in Japan 

[802.11-2007] 

802.11n 

6.5-600Mb/s Wireless LAN at 2.4 and 5GHz 
using optional MIMO and 40MHz channels 

[802.11n-2009] 

802.11s (draft) 

Mesh networking, congestion control 

Under development 

802.11y 

54Mb/s wireless LAN at 3.7GHz (licensed) 

[802.11y-2008] 

802.16 

Broadband Wireless Access Systems (WiMAX) 

[802.16-2009] 

802.16d 

Fixed Wireless MAN Standard (WiMAX) 

[802.16-2009] 

802.16e 

Fixed/Mobile Wireless MAN Standard (WiMAX) 

[802.16-2009] 

802.16h 

Improved Coexistence Mechanisms 

[802.16h-2010] 

802.16j 

Multihop Relays in 802.16 

[802.16j-2009] 

802.16k 

Bridging of 802.16 

[802.16k-2007] 

802.21 

Media Independent Handovers 

[802.21-2008] 
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Other than the specific types of LAN nefworks defined by fhe 802.3, 802.11, 
and 802.16 sfandards, fhere are some relafed sfandards fhaf apply across all of 
fhe IEEE sfandard LAN fechnologies. Common fo all fhree of fhese is fhe 802.2 
sfandard fhaf defines fhe Logical Link Control (LLC) frame header common among 
many of fhe 802 nefworks' frame formafs. In IEEE ferminology, LLC and MAC 
are "sublayers" of fhe link layer, where fhe LLC (mosfly frame formaf) is generally 
common fo each fype of nefwork and fhe MAC layer may be somewhaf differenf. 
While fhe original Efhernef made use of CSMA/CD, for example, WLANs of fen 
make use of CSMA/CA (CA is "collision avoidance"). 


Note 

Unfortunately the combination of 802.2 and 802.3 defined a different frame format 
from Ethernet II until 802.3x finally rectified the situation. It has been incorpo¬ 
rated into [802.3-2008]. In the TCP/IP world, the encapsulation of IP datagrams 
is defined in [RFC0894] and [RFC2464] for Ethernet networks, although the older 
LLC/SNAP encapsulation remains published as [RFC1042]. While this is no lon¬ 
ger much of an issue, it was once a source of concern, and similar issues occa¬ 
sionally arise [RFC4840]. 


The frame formaf has remained essenfially fhe same unfil fairly recenfly. To 
gef an undersfanding of fhe defails of fhe formaf and how if has evolved, we now 
furn our focus fo fhese defails. 

3.2.2 The Ethernet Frame Format 

All Ethernet (802.3) frames are based on a common format. Since its original speci¬ 
fication, the frame format has evolved to support additional functions. Eigure 3-3 
shows the current layout of an Ethernet frame and how it relates to a relatively new 
term introduced by IEEE, the IEEE packet (a somewhat unfortunate term given its 
uses in other standards). 

The Ethernet frame begins with a Preamble area used by the receiving inter¬ 
face's circuitry to determine when a frame is arriving and to determine the amount 
of time between encoded bits (called clock recovery). Because Ethernet is an asyn¬ 
chronous LAN (i.e., precisely synchronized clocks are not maintained in each Eth¬ 
ernet interface card), the space between encoded bits may differ somewhat from 
one interface card to the next. The preamble is a recognizable pattern (OxAA typi¬ 
cally), which the receiver can use to "recover the clock" by the time the start frame 
delimiter (SED) is found. The SED has the fixed value OxAB. 


Note 

The original Ethernet encoded bits using a Manchester Phase Encoding (MPE) 
with two voltage levels. With MPE, bits are encoded as voltage transitions rather 
than absolute values. For example, the bit 0 is encoded as a transition from -0.85 
to +0.85V, and a 1 bit is encoded as a -1-0.85 to -0.85V transition (OV indicates 
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that the shared wire is idle). The 10Mb/s Ethernet specification required network 
hardware to use an osciilator running at 20l\/lHz, because MPE requires two ciock 
cycies per bit. The bytes OxAA (10101010 in binary) present in the Ethernet pre- 
ambie would be a square wave between +0.85 and -0.85V with a frequency of 
10MHz. Manchester encoding was repiaced with different encodings in other Eth¬ 
ernet standards to improve efficiency. 


This basic frame format includes 48-bit (6-byte) Destination {DST) and Source 
(SRC) Address fields. These addresses are sometimes known by other names such 
as "MAC address," "link-layer address," "802 address," "hardware address," or 
"physical address." The destination address in an Ethernet frame is also allowed 
to address more than one station (called "broadcast" or "multicast"; see Chap¬ 
ter 9). The broadcast capability is used by the ARP protocol (see Chapter 4) and 
multicast capability is used by the ICMPv6 protocol (see Chapter 8) to convert 
between network-layer and link-layer addresses. 

Following the source address is a Type field that doubles as a Length field. Ordi¬ 
narily, it identifies the type of data that follows the header. Popular values used 
with TCP/IP networks include IPv4 (0x0800), IPv6 (0x86DD), and ARP (0x0806). 
The value 0x8100 indicates a Q-tagged frame (i.e., one that can carry a "virtual 
LAN" or VLAN ID according to the 802.1q standard). The size of a basic Ethernet 
frame is 1518 bytes, but the more recent standard extended this size to 2000 bytes. 
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Basic Frames: 64-1518 bytes 
Q-Tagged Frames: 64-1522 bytes 
Envelope Frames: 64-2000 bytes 

' Up to 482 bytes of tags allowed in envelope frames 
(Q-tagged frames are envelope frames) 


Figure 3-3 The Ethernet (IEEE 802.3) frame format contains source and destination addresses, an overloaded 
Length/Type field, a field for data, and a frame check sequence (a CRC32). Additions to the basic 
frame format provide for a tag containing a VLAN ID and priority information (802.1p/q) and 
more recently for an extensible number of tags. The preamble and SFD are used for synchroniz¬ 
ing receivers. When half-duplex operation is used with Ethernet running at lOOMb/s or more, 
additional bits may be appended to short frames as a carrier extension to ensure that the collision 
detection circuitry operates properly. 
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Note 

The original IEEE (802.3) specification treats the Length/Type field as a Length 
field instead of a Type field. The field is thereby overloaded (used for more than 
one purpose). The trick is to look at the value of the field. Today, If the value In the 
field is greater than or equal to 1536, the field must contain a type value, which 
is assigned by standards to have values exceeding 1536. If the value of the field 
is 1500 or less, the field indicates the length. The full list of types is given by 
[ETHERTYPES]. 


Following the Destination and Source Address fields, [802.3-2008] provides for 
a variable number of tags fhaf confain various protocol fields defined by ofher 
IEEE sfandards. The mosf common of fhese are fhe fags used by 802.1p and 802.1q, 
which provide for virfual LANs and some quality-of-service (QoS) indicators. These 
are discussed in Secfion 3.2.3. 


Note 

The current [802.3-2008] standard incorporates the frame format modifications 
of 802.3 as that provides for a maximum of 482 bytes for holding “tags” to be car¬ 
ried with each Ethernet frame. These larger frames, called envelope frames, may 
be up to 2000 bytes in length. Frames containing 802.1p/q tags, called Q-tagged 
frames, are also envelope frames. However, not all envelope frames are neces¬ 
sarily Q-tagged frames. 


Eollowing the fields discussed so far is the data area or payload portion of the 
frame. This is the area where higher-layer PDUs such as IP datagrams are placed. 
Traditionally, the payload area for Ethernet has always been 1500 bytes, represent¬ 
ing the MTU for Ethernet. Most systems today use the 1500-byte MTU size for 
Ethernet, although it is generally possible to configure a smaller value if this is 
desired. The payload sometimes is padded (appended) with 0 bytes to ensure that 
the overall frame meets the minimum length requirements we discuss in Section 
3.2.2.2. 

3.2.2.1 Frame Check Sequence/Cyclic Redundancy Check (CRC) 

The final field of the Ethernet frame format follows the payload area and provides 
an integrity check on the frame. The Cyclic Redundancy Check (CRC) field at the 
end includes 32 bits and is sometimes known as the lEEE/ANSI standard CRC32 
[802.3-2008]. To use an n-bit CRC for detection of data transmission in error, the 
message to be checked is first appended with n 0 bits, forming the augmented mes¬ 
sage. Then, the augmented message is divided (using modulo-2 division) by an (n 
+ l)-bit value called the generator polynomial, which acts as the divisor. The value 
placed in the CRC field of the message is the one's complement of the remainder of 
this division (the quotient is discarded). Generator polynomials are standardized 
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for a number of differenf values of n. For Efhernef, which uses n = 32, fhe CRC32 
generator polynomial is fhe 33-bif binary number 100000100110000010001110110 
110111. To gef a feeling for how fhe remainder is compufed using long (mod-2) 
binary division, we can examine a simpler case using CRC4. The ITU has sfan- 
dardized fhe value 10011 for fhe CRC4 generafor polynomial in a sfandard called 
G.704 [G704]. If we wish to send fhe 16-bif message 1001111000101111, we firsf 
begin wifh fhe long (mod-2) binary division shown in Figure 3-4. 


1000011000000101 Quotient (Discarded) 

10011 I 10011110001011110000 Message 

10011 

00001 

00000 

00011 

00000 

00110 

00000 

01100 

00000 

11000 

10011 

10111 

10011 

01000 

00000 

10001 

10011 

00101 

00000 

01011 

00000 

10111 

10011 

01000 

00000 

10000 

10011 

OHIO 

00000 

11100 

10011 

1111 Remainder 


Figure 3-4 Long (mod-2) binary division demonstrating the computation of a CRC4 
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In this figure, we see that the remainder after division is the 4-bit value 1111. 
Ordinarily, the one's complement of this value (0000) would be placed in a CRC or 
Frame Check Sequence (FCS) field in the frame. Upon receipt, the receiver performs 
the same division and checks whether the value in the FCS field matches the com¬ 
puted remainder. If the two do not match, the frame was likely damaged in transit 
and is usually discarded. The CRC family of functions can be used to provide a 
strong indicator of corrupted messages because any change in the bit pattern is 
highly likely to cause a change in the remainder term. 

3.2.2.2 Frame Sizes 

There is both a minimum and a maximum size of Ethernet frames. The minimum 
is 64 bytes, requiring a minimum data area (payload) length of 48 bytes (no tags). 
In cases where the payload is smaller, pad bytes (value 0) are appended to the end 
of the payload portion to ensure that the minimum length is enforced. 


Note 

The minimum was important for the original lOMb/s Ethernet using CSMA/CD. 
In order for a transmitting station to know which frame encountered a collision, a 
limit of 2500m (five 500m cable segments with four repeaters) was placed upon 
the length of an Ethernet network. Given that the propagation rate for electrons 
In copper Is about .77c or 231M m/s, and given the transmission time of 64 bytes 
to be (64 * 8/10,000,000) = 51.2ps at lOMb/s, a minimum-size frame could con¬ 
sume about 11,000m of cable. With a maximum of 2500m of cable, the maximum 
round-trip distance from one station to another is 5000m. The designers of Eth¬ 
ernet included a factor of 2 overdesign in fixing the minimum frame size, so in all 
compliant cases (and many noncompllant cases), the last bit of an outgoing frame 
would still be in the process of being transmitted after the time required for its sig¬ 
nal to arrive at a maximally distant receiver and return. If a collision is detected, 
the transmitting station thus knows with certainty which frame collided—the one 
it is currently transmitting. In this case, the station sends a jamming signal (high 
voltage) to alert other stations, which then initiate a random binary exponential 
backoff procedure. 


The maximum frame size of convenfional Efhernef is 1518 byfes (including 
fhe 4-byfe CRC and 14-byfe header). This value represenfs a sorf of frade-off: if 
a frame confains an error (defecfed on receipf by an incorrecf CRC), only 1.5KB 
need fo be refransmiffed fo repair fhe problem. On fhe ofher hand, fhe size limifs 
fhe MTU fo nof more fhan 1500 byfes. In order fo send a larger message, mulfiple 
frames are required (e.g., 64KB, a common larger size used wifh TCP/IP nefworks, 
would require af leasf 44 frames). 

The unforfunafe consequence of requiring mulfiple Efhernef frames fo hold a 
larger upper-layer PDU is fhaf each frame confribufes a fixed overhead (14 byfes 
header, 4 byfes CRC). To make maffers worse, Efhernef frames cannof be squished 
fogefher on fhe nefwork wifhouf any space befween fhem, in order fo allow fhe 
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Ethernet hardware receiver circuits to properly recover data from the network and 
to provide the opportunity for ofher sfafions fo inferleave fheir fraffic wifh fhe 
exisfing Efhernef fraffic. The Efhernef II specificafion, in addifion fo specifying a 
7-byfe preamble and 1-byfe SED fhaf precedes any Efhernef frame, also specifies 
an inter-packet gap (IPG) of 12 byfe fimes (9.6ps af lOMb/s, 960ns af 100Mb/s, 96ns 
af 1000Mb/s, and 9.6ns af 10,000Mb/s). Thus, fhe per-frame efficiency for Efhernef 
II is af mosf 1500/(12 + 8 + 14 + 1500 + 4) = 0.975293, or abouf 98%. One way fo 
improve efficiency when moving large amounfs of dafa across an Efhernef would 
be fo make fhe frame size larger. This has been accomplished using Efhernef yM?ni>o 
frames [JE], a nonsfandard exfension fo Efhernef (in lOOOMb/s Efhernef swifches 
primarily) fhaf fypically allows fhe frame size fo be as large as 9000 byfes. Some 
environmenfs make use of so-called super jumbo frames, which are usually under- 
sfood fo carry more fhan 9000 byfes. Care should be faken when using jumbo 
frames, as fhese larger frames are nof inferoperable wifh fhe smaller 1518-byfe 
frame size used by mosf legacy Efhernef equipmenf. 

3.2.3 802.1p/q: Virtual LANs and QoS Tagging 

Wifh fhe growing use of swifched Efhernef, if has become possible fo inferconnecf 
every compufer af a sife on fhe same Efhernef LAN. The advanfage of doing fhis 
is fhaf any hosf can direcfly communicafe wifh any ofher hosf, using IP and ofher 
nefwork-layer protocols, and requiring little or no adminisfrafor configurafion. In 
addifion, broadcasf and mulficasf fraffic (see Chapfer 9) is disfribufed fo all hosfs 
fhaf may wish fo receive if wifhouf having fo sef up special mulficasf roufing profo- 
cols. While fhese represenf some of fhe advanfages of placing many sfafions on fhe 
same Efhernef, having broadcasf fraffic go fo every compufer can creafe an unde¬ 
sirable amounf of nefwork fraffic when many hosfs use broadcasf, and fhere may 
be some securify reasons fo disallow complefe any-fo-any sfafion communicafion. 

To address some of fhese problems wifh running large, mulfiuse swifched 
nefworks, IEEE exfended fhe 802 LAN sfandards wifh a capabilify called virtual 
LANs (VLANs) in a sfandard known as 802.1q [802.1Q-2005]. Complianf Efhernef 
swifches isolafe fraffic among hosfs fo common VLANs. Note fhaf because of fhis 
isolafion, fwo hosfs affached fo fhe same swifch buf operafing on differenf VLANs 
require a roufer befween fhem for fraffic fo flow. Combinafion swifch/roufer 
devices have been created fo address fhis need, and ulfimafely fhe performance of 
roufers has been improved fo mafch fhe performance of VLAN swifching. Thus, 
fhe appeal of VLANs has diminished somewhaf, in favor of modern high-perfor¬ 
mance roufers. Nonefheless, fhey are sfill used, remain popular in some environ¬ 
menfs, and are imporfanf fo undersfand. 

Several mefhods are used fo specify fhe sfafion-fo-VLAN mapping. Assign¬ 
ing VLANs by porf is a simple and common mefhod, whereby fhe swifch porf 
fo which fhe sfafion is affached is assigned a parficular VLAN, so any sfafion so 
affached becomes a member of fhe associafed VLAN. Ofher opfions include MAC- 
address-based VLANs fhaf use fables wifhin Efhernef swifches fo map a sfafion's 
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MAC address to a corresponding VLAN. This can become difficult to manage if 
sfafions change fheir MAC addresses (which fhey do somefimes, fhanks fo fhe 
behavior of some users). IP addresses can also be used as a basis for assigning 
VLANs. 

When sfafions in differenf VLANs are affached fo fhe same swifch, fhe swifch 
ensures fhaf fraffic does nof leak from one VLAN fo anofher, irrespecfive of fhe 
fypes of Efhernef inferfaces being used by fhe sfafions. When mulfiple VLANs 
musf span mulfiple swifches (trunking), if becomes necessary fo label Efhernef 
frames wifh fhe VLAN fo which fhey belong before fhey are senf fo anofher 
swifch. Supporf for fhis capabilify uses a fag called fhe VLAN tag (or header), 
which holds 12 bifs of VLAN identifier (providing for 4096 VLANs, alfhough VLAN 
0 and VLAN 4095 are reserved). If also confains 3 bifs of priorify for supporfing 
QoS, defined in fhe 802.1p sfandard, as indicafed in Eigure 3-3. In many cases, fhe 
adminisfrafor musf configure fhe porfs of fhe swifch fo be used fo send 802.1p/q 
frames by enabling frunking on fhe appropriafe porfs. To make fhis job somewhaf 
easier, some swifches supporf a native VLAN opfion on frunked porfs, meaning 
fhaf unfagged frames are by defaulf associafed wifh fhe nafive VLAN. Trunking 
porfs are used fo inferconnecf VLAN-capable swifches, and ofher porfs are typi¬ 
cally used to attach stations. Some switches also support proprietary methods for 
VLAN frunking (e.g., fhe Cisco Inter-Switch Link (ISL) profocol). 

802.1p specifies a mechanism fo express a QoS idenfifier on each frame. The 
802.1p header includes a 3-bif-wide Priority field indicafing a QoS level. This 
sfandard is an exfension of fhe 802.1q VLAN sfandard. The fwo sfandards work 
fogefher and share bifs in fhe same header. Wifh fhe 3 available bifs, eighf classes 
of service are defined. Class 0, fhe lowesf priorify, is for convenfional, besf-efforf 
fraffic. Class 7 is fhe highesf priorify and mighf be used for crifical roufing or nef- 
work managemenf funcfions. The sfandards specify how priorifies are encoded in 
packefs buf leave fhe policy fhaf governs which packefs should receive which class, 
and fhe underlying mechanisms implemenfing priorifized services, fo be defined 
by fhe implemenfer. Thus, fhe way fraffic of one priorify class is handled relafive fo 
anofher is implemenfafion- or vendor-defined. Nofe fhaf 802.1p can be used inde- 
pendenfly of VLANs if fhe VLAN ID field in fhe 802.1p/q header is sef fo 0. 

The Linux command for manipulafing 802.1p/q informafion is called vc on- 
fig. If can be used fo add and remove virfual inferfaces associafing VLAN IDs fo 
physical inferfaces. If can also be used fo sef 802.1p priorifies, change fhe way vir¬ 
fual inferfaces are idenfified, and influence fhe mapping befween packefs fagged 
wifh cerfain VLAN IDs and how fhey are priorifized during profocol processing 
in fhe operafing sysfem. The following commands add a virfual inferface fo infer- 
face ethl wifh VLAN ID 2, remove if, change fhe way such virfual inferfaces are 
named, and add a new inferface: 

Linux# vconfig add ethl 2 

Added VLAN with VID =:= 2 to IF -:ethl:- 

Linux# ifconfig ethl.2 
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ethl.2 Link encap:Ethernet HWaddr 00:04:5A:9F:9E:80 
BROADCAST MULTICAST MTU:1500 Metric:! 

RX packets:0 errors:0 dropped:0 overruns:0 frame:0 
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:0 
RX bytes:0 (0.0b) TX bytes:0 (0.0 b) 

Linux# vconfig rem ethl.2 
Removed VLAN -:ethl.2:- 

Linux# vconfig set_naine_type VLAN_PLUS_VID 

Set name-type for VLAN subsystem. Should be visible in 
/proc/net/vlan/config 
Linux# vconfig add ethl 2 
Added VLAN with VXD == 2 to IF -:ethl:- 
Linux# ifconfig vlan0002 

vlan0002 Link encap:Ethernet HWaddr 00:04:5A:9F:9E:80 
BROADCAST MULTICAST MTU: 1500 Metric:! 

RX packets:0 errors:0 dropped:0 overruns:0 frame:0 
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txgueuelen:0 
RX bytes:0 (0.0b) TX bytes:0 (0.0 b) 

Here we can see that the default method of naming virfual inferfaces in Linux 
is based on concafenafing fhe associafed physical inferface wifh fhe VLAN ID. For 
example, VLAN ID 2 associafed wifh fhe inferface ethl is called ethl.2. This 
example also shows how an alfernafive naming mefhod can be used, whereby fhe 
VLANs are enumerafed by fhe names vlan<n> where <n> is fhe idenfifier of fhe 
VLAN. Once fhis is sef up, frames senf on fhe VLAN device are fagged wifh fhe 
VLAN ID, as expecfed. We can see fhis using Wireshark, as shown in Figure 3-5. 



Figure 3-5 Frames tagged with the VLAN ID as shown in Wireshark. The default columns and set¬ 
tings have been changed to display the VLAN ID and raw Ethernet addresses. 
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This figure shows an ARP packet (see Chapter 4) carried on VLAN 2. We can 
see that the frame size is 60 bytes (not including CRC). The frame is encapsulated 
using the Ethernet II encapsulation with type 0x8100, indicating a VLAN. Other 
than the VLAN header, which indicates that this frame belongs to VLAN 2 and 
has priority 0, this frame is unremarkable. All the other fields are as we would 
expect with a regular ARP packet. 

3.2.4 802.1AX: Link Aggregation (Formeriy 802.3ad) 

Some systems equipped with multiple network interfaces are capable of bonding or 
link aggregation. With link aggregation, two or more interfaces are treated as one in 
order to achieve greater reliability through redundancy or greater performance by 
splitting (striping) data across multiple interfaces. The IEEE Amendment 802.1AX 
[802.1AX-2008] defines the most common method for performing link aggregation 
and the Link Aggregation Control Protocol (LACP) to manage such links. LACP uses 
IEEE 802 frames of a particular format (called LACPDUs). 

Using link aggregation on Ethernet switches that support it can be a cost- 
effective alternative to investing in switches with high-speed network ports. If 
more than one port can be aggregated to provide adequate bandwidth, higher- 
speed ports may not be required. Link aggregation may be supported not only on 
network switches but across multiple network interface cards (NICs) on a host com¬ 
puter. Often, aggregated ports must be of the same type, operating in the same 
mode (i.e., half- or full-duplex). 

Linux has the capability to implement link aggregation (bonding) across dif¬ 
ferent types of devices using the following commands: 


Linux# modprobe bonding 

Linux# ifconfig bondO 10.0.0.111 netmask 255.255.255.128 
Linux# ifenslave bondO ethO wlanO 


This set of commands first loads the bonding driver, which is a special type 
of device driver supporting link aggregation. The second command creates the 
bondO interface with the IPv4 address information provided. Although providing 
the IP-related information is not critical for creating an aggregated interface, it is 
typical. Once the ifenslave command executes, the bonding device, bondO, is 
labeled with the MASTER flag, and the ethO and wlanO devices are labeled with 
the SLAVE flag: 


bondO Link encap:Ethernet HWaddr 00:11:A3:00:2C:2A 

inet addr:10.0.0.Ill Beast:10.0.0.127 Mask:255.255.255.128 
inet6 addr: fe80::211:a3ff:feOO:2c2a/64 Scope:Link 
UP BROADCAST RUNNING MASTER MULTICAST MTU: 1500 Metric:! 

RX packets:2146 errors:0 dropped:0 overruns:0 frame:0 
TX packets:985 errors:0 dropped:0 overruns:0 carrier:0 
collisions:18 txqueuelen:0 

RX bytes:281939 (275.3 KiB) TX bytes:141391 (138.0 KiB) 
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ethO Link encap:Ethernet HWaddr 00:11:A3:00:2C:2A 

UP BROADCAST RUNNING SLAVE MULTICAST MTU: 1500 Metric:! 

RX packets:1882 errors:0 dropped:0 overruns:0 frame:0 
TX packets:961 errors:0 dropped:0 overruns:0 carrier:0 
collisions:18 txqueuelen:1000 

RX bytes:244231 (238.5 KiB) TX bytes:136561 (133.3 KiB) 
Interrupt:20 Base address:0x6c00 
wlanO Link encap:Ethernet HWaddr 00:11:A3:00:2C:2A 

UP BROADCAST SLAVE MULTICAST MTU: 1500 Metric:! 

RX packets:269 errors:0 dropped:0 overruns:0 frame:0 
TX packets:24 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:1000 

RX bytes:38579 (37.6 KiB) TX bytes:4830 (4.7 KiB) 

In this example, we have bonded together a wired Ethernet interface with 
a Wi-Fi interface. The masfer device, bondO, is assigned fhe IPv4 address infor- 
mafion we would fypically assign fo eifher of fhe individual inferfaces, and if 
receives fhe firsf slave's MAC address by defaulf. When IPv4 fraffic is senf ouf of 
fhe bondO virfual inferface, fhere are a number of possibilifies as fo which of fhe 
slave inferfaces will carry if. In Linux, fhe opfions are selecfed using argumenfs 
provided when fhe bonding driver is loaded. For example, a mode opfion defer- 
mines whefher round-robin delivery is used befween fhe inferfaces, one inferface 
acfs as a backup fo fhe ofher, fhe inferface is selecfed based on performing an XOR 
of fhe MAC source and desfinafion addresses, frames are copied fo all inferfaces, 
802.Sad sfandard link aggregafion is performed, or more advance load-balancing 
opfions are used. The second mode is used for high-availabilify sysfems fhaf can 
fail over fo a redundanf nefwork infrasfrucfure if one link has ceased funcfion- 
ing (defecfable by Mil monitoring; see [BOND] for more defails). The fhird mode 
is infended fo choose fhe slave inferface based on fhe fraffic flow. Wifh enough 
differenf desfinafions, fraffic befween fhe fwo sfafions is pinned fo one inferface. 
This can be useful when frying fo minimize reordering while also frying fo load- 
balance fraffic across mulfiple slave inferfaces. The fourfh mode is for faulf toler¬ 
ance. The fiffh mode is for use wifh 802.3ad-capable swifches, fo enable dynamic 
aggregafion over homogeneous links. 

The LACP profocol is designed fo make fhe job of seffing up link aggregafion 
simpler by avoiding manual configurafion. Typically fhe LACP "acfor" (clienf) and 
"parfner" (server) send LACPDUs every second once enabled. LACP aufomafi- 
cally defermines which member links can be aggregafed info a link aggregation 
group (LAG) and aggregates fhem. This is accomplished by sending a collecfion of 
informafion (MAC address, porf priorify, porf number, and key) across fhe link. A 
receiving sfafion can compare fhe values if sees from ofher porfs and perform fhe 
aggregafion if fhey mafch. Defails of LACP are covered in [802.1AX-2008]. 
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3.3 Full Duplex, Power Save, Autonegotiation, and 802.1 X 
Flow Control 

When Ethernet was first developed, it operated only in half-duplex mode using 
a shared cable. Thai is, dafa could be senf only one way af one fime, so only one 
sfafion was sending a frame af any given poinf in fime. Wifh fhe developmenf of 
swifched Efhernef, fhe nefwork was no longer a single piece of shared wire, buf 
insfead many sefs of links. As a resulf, mulfiple pairs of sfafions could exchange 
dafa simulfaneously. In addifion, Efhernef was modified fo operafe in full duplex, 
effecfively disabling fhe collision defecfion circuifry. This also allowed fhe physi¬ 
cal lengfh of fhe Efhernef fo be exfended, because fhe fiming consfrainfs associ- 
afed wifh half-duplex operafion and collision defecfion were removed. 

In Linux, fhe ethtool program can be used fo query whefher full duplex is 
supporfed and whefher if is being used. This fool can also display and sef many 
ofher inferesfing properfies of an Efhernef inferface: 


Linux# ethtool ethO 

Settings for ethO: 

Supported ports: [ TP Mil ] 

Supported link modes: lObaseT/Half lObaseT/Full 
lOObaseT/Half lOObaseT/Full 
Supports auto-negotiation: Yes 

Advertised link modes: lObaseT/Half lObaseT/Full 

lOObaseT/Half lOObaseT/Full 

Advertised auto-negotiation: Yes 

Speed: lOMb/s 

Duplex: Half 

Port: Mil 

PHYAD: 2 4 

Transceiver: internal 
Auto-negotiation: on 

Current message level: 0x00000001 (1) 

Link detected: yes 
Linux# ethtool ethl 
Settings for ethl: 

Supported ports: [ TP ] 

Supported link modes: lObaseT/Half lObaseT/Full 
lOObaseT/Half lOObaseT/Full 
lOOObaseT/Full 

Supports auto-negotiation: Yes 

Advertised link modes: lObaseT/Half lObaseT/Full 
lOObaseT/Half lOObaseT/Full 
lOOObaseT/Full 

Advertised auto-negotiation: Yes 

Speed: lOOMb/s 

Duplex: Full 

Port: Twisted Pair 

PHYAD: 0 

Transceiver: internal 
Auto-negotiation: on 
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Supports Wake-on: umbg 
Wake-on: g 

Current message level: 0x00000007 (7) 

Link detected: yes 

In this example, the first Ethernet interface (ethO) is attached to a half-duplex 
lOMb/s network. We can see that it is capable of autonegotiation, which is a mecha¬ 
nism originating with 802.3u to enable interfaces to exchange information such 
as speed and capabilities such as half- or full-duplex operation. Autonegotiation 
information is exchanged at the physical layer using signals sent when data is 
not being transmitted or received. We can see that the second Ethernet interface 
(ethl) also supports autonegotiation and has set its rate to lOOMb/s and operation 
mode to full duplex. The other values (Port, PHYAD, Transceiver) identify the 
physical port type, its address, and whether the physical-layer circuitry is internal 
or external to the NIC. The current message-level value is used to configure log 
messages associated with operating modes of the interface; its behavior is spe¬ 
cific to the driver being used. We discuss the wake-on values after the following 
example. 

In Windows, details such as these are available by navigating to Control Panel 
I Network Connections and then right-clicking on the interface of interest, select¬ 
ing Properties, and then clicking the Configure box and selecting the Advanced 
tab. This brings up a menu similar to the one shown in Pigure 3-6 (this particular 
example is from an Ethernet interface on a Windows 7 machine). 


Intel(R) 82577LM Gigabit Network Connection Properties 


General Advanced j Driver | Detads | Power Management | 

The following prop^ties are available for this network adapter. Click 
the property you want to ch^ge on the left. ar>d then select its value 
on the right 


xj 


Property: 

Adaptive Inter-Frame Spacing 
Enable PME 
Row Control 

Gigabit Master Slave Mode 
Interrupt Moderation 
Interrupt Moderation Rate 
IPv4 Checksum Offload 
Jumbo Packet 
Large Send Offload (IPv4) 
Large Send Offload (IPv6) 

Link Speed Battery Saver 
Locally Administered Address 
Log Link State Event 


d 


Value; 


Airto Negotiation 


3 


1.0 Gbps Ful Duplex 
10 Mbps Full Duplex 
10 Mbps Half Duplex 
100 Mbps Fun Duplex 
100 Mbps Half Duplex 


Auto Negotiation 


Figure 3-6 Advanced tab of network interface properties in Windows (7). This control allows the 
user to supply operating parameters to the network device driver. 
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In Figure 3-6, we can see the special features that can be configured using fhe 
adapfer's device driver. For fhis parficular adapfer and driver, 802.1p/q fags can 
be enabled or disabled, as can flow confrol and wake-up capabilifies (see Secfion 
3.3.2). The speed and duplex can be sef by hand, or fo fhe more fypical aufonego- 
fiafion opfion. 

3.3.1 Duplex Mismatch 

Hisforically, fhere have been some inferoperabilify problems using aufonegofia- 
fion, especially when a compufer and ifs associafed swifch porf are configured 
using differenf duplex configurafions or when aufonegofiafion is disabled af one 
end of fhe link buf nof fhe ofher. In fhis case, a so-called duplex mismatch can occur. 
Perhaps surprisingly, when fhis happens fhe connecfion does nof complefely fail 
buf insfead may suffer significanf performance degradafion. When fhe nefwork 
has moderafe fo heavy fraffic in bofh direcfions (e.g., during a large dafa frans- 
fer), a half-duplex inferface can defecf incoming fraffic as a collision, friggering 
fhe exponenfial backoff funcfion of fhe CSMA/CD Efhernef MAC. Af fhe same 
fime, fhe dafa friggering fhe collision is losf and may require higher-layer profo- 
cols such as TCP fo refransmif. Thus, fhe performance degradafion may be noficed 
only when fhere is sufficienf fraffic for fhe half-duplex inferface fo be receiving 
dafa af fhe same fime if is sending, a sifuafion fhaf does nof generally occur under 
lighf load. Some researchers have affempfed fo build analysis fools fo defecf fhis 
unforfunafe sifuafion [SC05]. 

3.3.2 Wake-on LAN (WoL), Power Saving, and Magic Packets 

In both the Linux and Windows examples, we saw some indication of power man¬ 
agement capabilities. In Windows the Wake-Up Capabilities and in Linux the Wake- 
On options are used to bring the network interface and/or host computer out of 
a lower-power (sleep) state based on the arrival of certain kinds of packets. The 
kinds of packets used to trigger the change to full-power state can be configured. 
In Linux, the Wake-On values are zero or more bits indicating whether receiv¬ 
ing the following types of frames trigger a wake-up from a low-power state: any 
physical-layer (PHY) activity (p), unicast frames destined for the station (u), mul¬ 
ticast frames (m), broadcast frames (b), ARP frames (a), magic packet frames (g), 
and magic packet frames including a password. These can be configured using 
options to ethtool. For example, the following command can be used: 


Linux# ethtool -s ethO wol umgb 


This command configures the ethO device to signal a wake-up if any of the 
frames corresponding to the types u, m, g, or b is received. Windows provides a 
similar capability, but the standard user interface allows only magic packet frames 
and a predefined subset of the u, m, b, and a frame types. Magic packets contain 
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a special repeated pattern of the byte value OxFF. Often, such frames are sent as a 
form of UDP packet (see Chapter 10) encapsulated in a broadcast Ethernet frame. 
Several tools are available to generate them, including wol [WOL]: 


Linux# wol 00:08:74:93:C8:3C 

Waking up 00:08:74:93:C8:3C... 


The result of this command is to construct a magic packet, which we can view 
using Wireshark (see Figure 3-7). 


Q magic-pkt.td - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony lools Help 

SI H « ai « ^ B X 0 a I ^ 


Time Source Dest 

10.000000 LinksysG_9f:9e:80 lntel_14:a9:cl 


Protocol 

WOL 


00® 


lEifa a 

Info 

MagicPacket for DellComp_93:c8:3c (00:08:74:93:c8:3c) 


a Frame 1: 144 bytes on wire (1152 bits), 144 bytes captured (1152 bits) 

B Ethernet ll, Src: LinksysG_9f:9e:80 (00:04:5a:9f:9e:80), Dst: lnteT_14:a9:cl (00:07:e9:14:a9:cl) 
a Internet Protocol, src: 10.0.0.1 (10.0.0.1), Dst: 10.0.0.13 (10.0.0.13) 

B user Datagram Protocol, Src Port: 1126 (1126), Dst Port: 40000 (40000) 


1 wake on lan, mac: Dellcomp_93:c8:3c (00:08:74:93:c8:3c) 


Sync stream: ff 
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MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 
MAC 


•fff 


-fff 


DellComp, 
Dell comp. 
Dellcomp. 
Dell comp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 
Dellcomp. 


:c8:3c 

93:c8: 

93:c8: 

93:c8: 

93:c8: 

93:c8: 

93:c8 

93:c8: 

93:c8: 

93:c8: 

93:c8: 

93:c8: 

93:c8: 

93:c8: 

93:c8: 

93:c8: 

93:c8: 


(00:08:74:93 
3c (00:08:74: 
3c (00:08:74: 
3c (00:08:74: 
3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 

3c (00:08:74: 


:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 

93:c8:3c) 


;uoou ou 0/ ey 14 ay ci ou 04 ba yf ye ao ua ou 4b uu 


0010 

0020 

00 

00 

82 

Od 

00 

04 

00 

66 

40 

9c 

00 

40 

40 

00 

11 

6e 

26 

7c 

5e 

da 

frff ff ff ff 

0030 

00 

08 

74 

93 

C8 

3c 

00 

08 

74 

93 

C8 

3c 

00 

08 

74 

93 

0040 

C8 

3c 

00 

08 

74 

93 

C8 

3c 

00 

08 

74 

93 

C8 

3C 

00 

08 

0050 

74 

93 

c8 

3c 

00 

08 

74 

93 

c8 

3c 

00 

08 

74 

93 

c8 

3c 

0060 

00 

08 

74 

93 

C8 

3c 

00 

08 

74 

93 

C8 

3c 

00 

08 

74 

93 

0070 

c8 

3c 

00 

08 

74 

93 

c8 

3c 

00 

08 

74 

93 

c8 

3c 

00 

08 


74 

93 

C8 

3c 

00 

08 

74 

93 

c8 

3c 

00 

08 

74 

93 

C8 

3C 



Figure 3-7 A magic packet frame in Wireshark begins with 6 OxFF bytes and then repeats the MAC 
address 16 times. 


The packet shown in Figure 3-7 is mostly a conventional UDP packet, although 
the port numbers (1126 and 40000) are arbitrary. The most unusual part of the 
packet is the data area. It contains an initial 6 bytes with the value OxFF. The rest 
of the data area includes the destination MAC address 00:08:74:93:C8:3C repeated 
16 times. This data payload pattern defines the magic packet. 
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3.3.3 Link-Layer Flow Control 

Operating an extended Ethernet LAN in full-duplex mode and across segments of 
differenf speeds may require fhe swifches fo buffer (store) frames for some period 
of fime. This happens, for example, when mulfiple sfafions send fo fhe same des- 
finafion (called oufpuf porf confenfion). If fhe aggregafe fraffle rate headed for a 
sfafion exceeds fhe sfafion's link rafe, frames sfarf fo be sfored in fhe infermediafe 
swifches. If fhis sifuafion persisfs for a long fime, frames may be dropped. 

One way fo mifigafe fhis sifuafion is fo apply flow control fo senders (i.e., slow 
fhem down). Some Efhernef swifches (and interfaces) implemenf flow confrol by 
sending special signal frames befween swifches and NICs. Elow confrol signals fo 
fhe sender fhaf if musf slow down ifs fransmission rafe, alfhough fhe specificafion 
leaves fhe defails of fhis fo fhe implemenfafion. Efhernef uses an implemenfafion 
of flow confrol called PAUSE messages (also called PAUSE frames), specified by 
802.3X [802.3-2008]. 

PAUSE messages are confained in MAC confrol frames, idenfified by fhe 
Efhernef Length/Type field having fhe value 0x8808 and using fhe MAC confrol 
opcode of 0x0001. A receiving sfafion seeing fhis is advised fo slow ifs rafe. PAUSE 
frames are always senf fo fhe MAC address 01:80:C2:00:00:01 and are used only 
on full-duplex links. They include a hold-off time value (specified in quantas equal 
fo 512 bif fimes), indicafing how long fhe sender should pause before confinuing 
fo fransmif. 

The MAC confrol frame is a frame formaf using fhe regular encapsulafion 
from Eigure 3-3, buf wifh a 2-byfe opcode immediafely following fhe Length/Type 
field. PAUSE frames are essenfially fhe only type of frames fhaf uses MAC confrol 
frames. They include a 2-byfe quanfify encoding fhe hold-off fime. Implemenfafion 
of fhe "enfire" MAC confrol layer (basically, jusf 802.3x flow confrol) is opfional. 

Using Efhernef-layer flow confrol may have a significanf negafive side effeef, 
and for fhis reason if is fypically nof used. When mulfiple sfafions are sending 
fhrough a swifeh (see fhe nexf seefion) fhaf is becoming overloaded, fhe swifeh 
may nafurally send PAUSE frames fo all hosfs. Unforfunafely, fhe ufilizafion of 
fhe swifeh's memory may nof be symmefric wifh respeef fo fhe sending hosfs, so 
some may be penalized (flow-confrolled) even fhough fhey were nof responsible 
for much of fhe fraffic passing fhrough fhe swifeh. 


3.4 Bridges and Switches 

The IEEE 802.1d sfandard specifies fhe operafion of bridges, and fhus swifches, 
which are essenfially high-performance bridges. A bridge or swifeh is used fo join 
mulfiple physical link-layer nefworks (e.g., a pair of physical Efhernef segmenfs) or 
groups of sfafions. The mosf basic sefup involves conneefing fwo swifches fo form 
an extended LAN, as shown in Eigure 3-8. 
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Figure 3-8 A simple extended Ethernet LAN with two switches. Each switch port has a number for 
reference, and each station (including each switch) has its own MAC address. 


Switches A and B in the figure have been interconnected to form an extended 
LAN. In this particular example, client systems are connected to A and servers 
to B, and ports are numbered for reference. Note that every network element, 
including each switch, has its own MAC address. Nonlocal MAC addresses are 
"learned" by each bridge over time so that eventually every switch knows the port 
upon which every station can be reached. These lists are stored in tables (called 
filtering databases) within each switch on a per-port (and possibly per-VLAN) basis. 
As an example, after each switch has learned the location of every station, these 
databases would contain the information shown in Figure 3-9. 
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Switch A’s Database 
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Switch B’s Database 


Figure 3-9 Filtering databases on switches A and B from Figure 3-8 are created over time ("learned") 
by observing the source address on frames seen on switch ports. 


When a switch (bridge) is first turned on, its database is empty, so it does 
not know the location of any stations except itself. Whenever it receives a frame 
destined for a station other than itself, it makes a copy for each of the ports other 
than the one on which the frame arrived and sends a copy of the frame out of each 
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one. If switches (bridges) never learned the location of stations, every frame would 
be delivered across every network segment, leading to unwanted overhead. The 
learning capability reduces overhead significantly and is a standard feature of 
switches and bridges. 

Today, most operating systems support the capability to bridge between net¬ 
work interfaces, meaning that a standard computer with multiple interfaces can 
be used as a bridge. In Windows, for example, interfaces may be bridged together 
by navigating to the Network Connections menu from the Control Panel, high¬ 
lighting the interfaces to bridge, right-clicking the mouse, and selecting Bridge 
Connections. When this is done, a new icon appears that represents the bridging 
function itself. Most of the normal network properties associated with the inter¬ 
faces are gone and instead appear on the bridge device (see Figure 3-10). 


B Network Bridge Properties 


Netwoiking | 

■ Adapters:- 

Select the adapters you want to use to connect to computers 
on your local network. 

□ t^ll. Wireless Network Connection 0 
0 Ojt Local Area Connection 
0 *5 Wireless Network Connection 


Configure... 


This cgr¥iection uses the following items: 


{□ient for Microsoft Networks 


0 J^QoS Packet Scheduler 
0 ^ Pile and Printer Sharing for Microsoft Networks 
0 Internet Protocol Version 6 (TCP/IPv6) 

0 Internet Protocol Version 4 (TCP/lPv4) 

0 Link-Layer Topology Dscovery Mapper I/O Driver 

0 Link-Layer Topology Discovery Responder 

Install... I UninstaH ] Properties | 


Cancel 


Figure 3-10 In Windows, the bridge device is created by highlighting the network interfaces to be 
bridged, right-clicking, and selecting the Bridge Network Interfaces function. Once the 
bridge is established, further modifications are made to the bridge device. 


Figure 3-10 shows the Froperties panels for the network bridge virtual device 
on Windows 7. The bridge device's properties include a list of the underlying 
devices being bridged and the set of services running on the bridge (e.g., the 
Microsoft Networks client. File and Frinter Sharing, etc.). Linux works in a similar 
way, using command-line arguments. We use the topology shown in Figure 3-11 
for this example. 
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eth1: 00:07:e9:14:a9:c1 ethO: 00:08:74:93:c8:3c 

\ , 00:04:5a:9f:9e:80 



Figure 3-11 In this simple topology, a Linux-based PC is configured to operate as a bridge between 
the two Ethernet segments it interconnects. As a learning bridge, it accumulates tables 
of which port should be used to reach the various other systems on the extended LAN. 


The simple network in Figure 3-11 uses a Linux-based PC with two Ethernet 
ports as a bridge. Attached to port 2 is a single station, and the rest of the network 
is attached to port 1. The following commands enable the bridge: 


Linux# brctl addbr brO 
Linux# brctl addif brO ethO 
Linux# brctl addif brO ethl 
Linux# ifconfig ethO up 
Linux# ifconfig ethl up 
Linux# ifconfig brO up 


This series of commands creates a bridge device brO and adds the interfaces 
ethO and ethl to the bridge. Interfaces can be removed using the brctl delif 
command. Once the interfaces are established, the brctl showmacs command 
can be used to inspect the filter databases (called forwarding databases or fdbs in 
Linux terminology): 


Linux# brctl show 

bridge name bridge id STP enabled interfaces 

brO 8000.0007e914a9cl no ethO ethl 

Linux# brctl showmacs brO 

port no mac addr is local? ageing timer 

1 00:04:5a:9f:9e:80 no 0.79 

2 00:07:e9:14:a9:cl yes 0.00 

1 00:08:74:93:c8:3c yes 0.00 

2 00:14:22:f4:19:5f no 0.81 
1 00:17:f2:e7:6d:91 no 2.53 

1 00:90:f8:00:90:b7 no 17.13 

The output of this command reveals one other detail about bridges. Because 
stations may move around, have their network cards replaced, have their MAC 
address changed, or other things, once the bridge discovers that a MAC address 
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is reachable via a certain port, this information cannot be assumed to be correct 
forever. To deal wifh fhis issue, each fime an address is learned, a fimer is sfarfed 
(commonly defaulfed fo 5 minufes). In Linux, a fixed amounf of fime associafed 
wifh fhe bridge is applied fo each learned enfry. If fhe address in fhe enfry is nof 
seen again wifhin fhe specified "ageing" fime, fhe enfry is removed, as indicafed 
here: 


Linux# brctl setageing brO 1 
Linux# brctl showmacs brO 

port no mac addr is local? ageing 

1 00:04:5a:9f:9e:80 no 0.76 

2 00 : 07:e9:14:a9:cl yes 0.00 

1 00:08:74:93:c8:3c yes 0.00 

2 00:14:22:f4:19:5f no 0.78 
1 00:17:f2:e7:6d:91 no 0.00 


timer 


Here, we have sef fhe ageing value unusually low for demonsfrafion pur¬ 
poses. When an enfry is removed because of aging, subsequenf frames for fhe 
removed desfinafion are once again senf ouf of every porf excepf fhe receiving one 
{called flooding), and fhe enfry is placed anew info fhe filfering dafabase. The use 
of filfering dafabases and learning is really a performance opfimizafion—if fhe 
fables are empfy, fhe nefwork experiences more overhead buf sfill funcfions. Nexf 
we furn our affenfion fo fhe case where more fhan fwo bridges are inferconnecfed 
wifh redundanf links. In fhis sifuafion, flooding of frames could lead fo a sorf of 
flooding cafasfrophe wifh frames looping forever. Obviously, we require a way of 
dealing wifh fhis problem. 


3.4.1 Spanning Tree Protocol (STP) 

Bridges may operafe in isolafion, or in combinafion wifh ofher bridges. When more 
fhan fwo bridges are in use (or in general when swifch porfs are cross-connecfed), 
fhe possibilify exisfs for a cascading, looping sef of frames fo be formed. Consider 
fhe nefwork shown in Figure 3-12. 

Assume fhaf fhe swifches in Figure 3-12 have jusf been fumed on and fheir 
filfering dafabases are empfy. When sfafion S sends a frame, swifch B replicafes 
fhe frame on porfs 7, 8, and 9. So far, fhe inifial frame has been "amplified" fhree 
fimes. These frames are received by swifches A, D, and C. Swifch A produces cop¬ 
ies of fhe frame on porfs 2 and 3. Swifches D and C produce more copies on porfs 
20, 22 and 13,14, respecfively. The amplificafion facfor has grown fo 6, wifh copies 
of fhe frames fraveling in bofh direcfions among swifches A, C, and D. Once fhese 
frames arrive, fhe forwarding dafabases begin fo oscillafe as fhe bridge affempfs fo 
figure ouf which porf is really fhe one fhrough which sfafion S should be reached. 
Obviously, fhis sifuafion is infolerable. If if were allowed fo occur, bridges used in 
such configurafions would be useless. Forfunafely, fhere is a protocol fhaf is used 
fo avoid fhis sifuafion called fhe Spanning Tree Protocol (STP). We describe STP in 
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Figure 3-12 An extended Ethernet network with four switches and multiple redundant links. If 
simple flooding were used in forwarding frames through this network, a catastrophe 
would occur because of excess multiplying traffic (a so-called broadcast storm). This 
type of situation requires the use of the STP. 


some detail to explain why some approach to duplicate suppression is needed for 
bridges and switches. In the current standard [802.1D-2004], conventional STP is 
replaced with the Rapid Spanning Tree Protocol (RSTP), which we describe after the 
conventional STP preliminaries. 

STP works by disabling certain ports at each bridge so that topological loops 
are avoided (i.e., no duplicate paths between bridges are permitted), yet the topol¬ 
ogy is not partitioned—all stations can be reached. Mathematically, a spanning 
tree is a collection of all of the nodes and some of the edges of a graph such that 
there is a path or route from any node to any other node {spanning the graph), but 
there are no loops (the edge set forms a tree). There can be many spanning trees on 
a graph. STP finds one of them for the graph formed by bridges as nodes and links 
as edges. Figure 3-13 illustrates the idea. 



Figure 3-13 Using STP, the B-A, A-C, and C-D links have become active on the spanning tree. Ports 
6, 7, 1, 2, 13, 14, and 20 are in the forwarding state; all other ports are blocked (i.e., not 
forwarding). This keeps frames from looping and avoids broadcast storms. If a configu¬ 
ration change occurs or a switch fails, the blocked ports are changed to the forwarding 
state and the bridges compute a new spanning tree. 
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In this figure, the dark lines represent the links in the network selected by STP 
for forwarding frames. None of the other links are used—ports 8,9,12,21,22, and 
3 are blocked. With STP, the various problems raised earlier do not occur, as frames 
are created only as the result of another frame arriving. There is no amplification. 
Furthermore, looping is avoided because there is only one path between any two 
stations. The spanning tree is formed and maintained by bridges using a distrib¬ 
uted algorithm running in each bridge. 

As with forwarding databases, STP must deal with the situation where bridges 
are turned off and on, interface cards are replaced, or MAC addresses are changed. 
Clearly, such changes could affect the operation of the spanning tree, so the STP 
adapts to these changes. The adaptation is implemented using an exchange of 
special frames called Bridge Protocol Data Units (BPDUs). These frames are used 
for forming and maintaining the spanning tree. The tree is "grown" from a bridge 
elected by the others and known as the "root bridge." 

As mentioned previously, there are many possible spanning trees for a given 
network. Determining which one might be the best to use for forwarding frames 
depends on a set of costs that can be associated with each link and the location of 
the root bridge. Costs are simply integers that are (recommended to be) inversely 
proportional to the link speeds. For example, a lOMb/s link has a recommended 
cost of 100, and lOOMb/s and lOOOMb/s links have recommended cost values of 19 
and 4, respectively. STP operates by computing least-cost paths to the root bridge 
using these costs. If multiple links must be traversed, the corresponding cost is 
simply the sum of the link costs. 

3.4.1.1 Port States and Roles 

To understand the basic operation of STP, we need to understand the operation of 
the state machine for each port at each bridge, as well as the contents of BPDUs. 
Each port in each bridge may be in one of five states: blocking, listening, learning, 
forwarding, and disabled. The relationship among them can be seen in the state 
transition diagram shown in Figure 3-14. 

The normal transitions for ports on the spanning tree are indicated in Figure 
3-14 by solid arrows, and the smaller arrows with dashed lines indicate changes 
due to administrative configuration. After initialization, a port enters the blocking 
state. In this state, it does not learn addresses, forward frames, or transmit BPDUs, 
but it does monitor received BPDUs in case it needs to be included in the future on 
a path to the root bridge, in which case the port transitions to the listening state. In 
the listening state, the port is now permitted to send as well as receive BPDUs but 
not learn addresses or forward data. After a typical forwarding delay timeout of 
15s, a port enters the learning state. Here it is permitted to do all procedures except 
forward data. It waits another forwarding delay before entering the forwarding 
state and commencing to forward frames. 

Related to the port state machine, each port is said to have a role. This termi¬ 
nology becomes more important with RSTP (see Section 3.4.1.6). A port may have 
the role of root port, designated port, alternate port, or backup port. Root ports are those 



Section 3.4 Bridges and Switches 


105 



Figure 3-14 Ports transition among four major states in normal STP operation. In the blocking state, 
frames are not forwarded, but a topology change or timeout may cause a transition to 
the listening state. The forwarding state is the normal state for active switch ports car¬ 
rying data traffic. The state names in parentheses indicate the port states according to 
the RSTP. 


ports at the end of an edge on the spanning tree headed toward the root. Desig¬ 
nated ports are ports in the forwarding sfafe acfing as fhe porf on fhe leasf-cosf 
pafh fo fhe roof from fhe affached segmenf. Alfernafe porfs are ofher porfs on an 
affached segmenf fhaf could also reach fhe roof buf af higher cosf. They are nof in 
fhe forwarding sfafe. A backup porf is a porf connecfed fo fhe same segmenf as a 
designafed porf on the same bridge. Thus, backup porfs could easily fake over for 
a failing designafed porf wifhouf disrupfing any of fhe resf of fhe spanning free 
topology buf do nof offer an alfernafe pafh fo fhe roof should fhe enfire bridge fail. 

3.4.1.2 BPDU Structure 

To defermine fhe links in fhe spanning free, STP uses BPDUs fhaf adhere fo fhe 
formaf shown in Figure 3-15. 

The formaf shown in Figure 3-15 applies fo bofh fhe original STP as well as 
fhe newer RSTP (see Secfion 3.4.1.6). BPDUs are always senf fo fhe group address 
01:80:C2:00:00:00 (see Chapter 9 for defails of link-layer group and Infernef mulfi- 
casf addressing) and are nof forwarded fhrough a bridge wifhouf modificafion. In 
fhe figure, fhe DST, SRC, and L/T {Length/Type) fields are parf of fhe convenfional 
Efhernef (802.3) header of fhe frame carrying fhe example BPDU. The 3-byfe LLC/ 
SNAP header is defined by 802.1 and for BPDUs is sef fo fhe consfanf 0x424203. 
Nof all BPDUs are encapsulafed using LLC/SNAP, buf fhis is a common opfion. 
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Frame 



(1) (1) (2 bits) (1) (1) (1) (1) 


Defined by 
802.1w 

Figure 3-15 BPDUs are carried in the payload area of 802 frames and exchanged between bridges to estab¬ 
lish the spanning tree. Important fields include the source, root node, cost to root, and topol¬ 
ogy change indication. With 802.1w and [802.1D-2004] (including Rapid STP or RSTP), additional 
fields indicate the state of the ports. 


The Protocol (Prot) field gives the protocol ID number, set to 0. The Version 
(Vers) field is set to 0 or 2, depending on whether STP or RSTP is in use. The Type 
field is assigned similarly. The Flags field contains Topology Change (TC) and Topol¬ 
ogy Change Acknowledgment (TCA) bits, defined by the original 802.1d standard. 
Additional bits are defined for Proposal (P), Port Role (00, unknown; 01, alternate; 
10, root; 11, designated). Learning (L), Forwarding (F), and Agreement (A). These are 
discussed in the context of RSTP in Section 3.4.I.6. The Root ID field gives the iden¬ 
tifier of the root bridge in the eyes of the sender of the frame, whose MAC address 
is given in the Bridge ID field. Both of these ID fields are encoded in a special way 
that includes a 2-byte Priority field immediately preceding the MAC address. The 
priority values can be manipulated by management software in order to force the 
spanning tree to be rooted at any particular bridge (Cisco, for example, uses a 
default value of 0x8000 in its Catalyst switches). 

The root path cost is the computed cost to reach the bridge specified in the 
Root ID field. The PID field is the port identifier and gives the number of the port 
from which the frame was sent appended to a 1-byte configurable Priority field 
(default 0x80). The Message A (MsgA) field gives the message age (see the next 
paragraph). The Maximum Age {MaxA) field gives the maximum age before time¬ 
out (default: 20s). The Hello Time field gives the time between periodic transmis¬ 
sions of configuration frames. The Forward Delay {Forw Delay) field gives the time 
spent in the learning and listening states. All of the age and time fields are given 
in units of l/256s. 
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The Message Age field is not a fixed value like the other time-related fields. 
When the root bridge sends a BPDU, it sets this field to 0. Any bridge receiving the 
frame emits frames on its non-root ports with the Message Age field incremented by 
1. In essence, the field acts as a hop count, giving the number of bridges by which 
the BPDU has been processed before being received. When a BPDU is received on 
a port, the information it contains is kept in memory and participates in the STP 
algorithm until it is timed out, which happens at time {MaxA - MsgA). Should 
this time pass on a root port without receipt of another BPDU, the root bridge is 
declared "dead" and the bridge starts the root bridge election process over again. 

3.4.1.3 Building the Spanning Tree 

The first job of STP is to elect the root bridge. The root bridge is discovered as 
the bridge in the network (or VLAN) with the smallest identifier (priority com¬ 
bined with MAC address). When a bridge initializes, it assumes itself to be the 
root bridge and sends configuration BPDUs with the Root ID field matching its 
own bridge ID, but if it detects a bridge with a smaller ID, it ceases sending its own 
frames and instead adopts the frame it received containing the smaller ID to be the 
basis for further BPDUs it sends. The port where the BPDU with the smaller root 
ID was received is then marked as the root port (i.e., the port on the path to the root 
bridge). The remaining ports are placed in either blocked or forwarding states. 

3.4.1.4 Topology Changes 

The next important job of STP is to handle topology changes. Although we could 
conceivably use the basic database aging mechanism described earlier to adapt to 
changing topologies, this is a poor approach because the aging timers can take a 
long time (5 minutes) to delete incorrect entries. Instead, STP incorporates a way 
to detect topology changes and inform the network about them quickly. In STP, a 
topology change occurs when a port has entered the blocking or forwarding states. 
When a bridge detects a connectivity change (e.g., a link goes down), the bridge noti¬ 
fies its parent bridges on the tree to the root by sending topology change notification 
(TCN) BPDUs out of its root port. The next bridge on the tree to the root acknowl¬ 
edges the TCN BPDUs to the notifying bridge and also forwards them on toward 
the root. Once informed of the topology change, the root bridge sets the TC bit field 
in subsequent periodic configuration messages. Such messages are relayed by every 
bridge in the network and are received by ports in either the blocking or forwarding 
states. The setting of this bit field allows bridges to reduce their aging time to that of 
the forward delay timer, on the order of seconds instead of the 5 minutes normally 
recommended for the aging time. This allows database entries that may now be 
incorrect to be purged and relearned more quickly, yet it also allows stations that 
are actively communicating to not have their entries deleted erroneously. 

3.4.1.5 Example 

In Linux, the bridge function disables STP by default, on the assumption that 
topologies are relatively simple in most cases where a regular computer is being 
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used as a bridge. To enable STP on the example bridge we are using so far, we can 
do the following: 


Linux# brctl stp brO on 


The consequences of executing this command can be inspected as follows: 


Linux# brctl showstp brO 

brO 


bridge id 

8000.0007e914a9cl 



designated root 

8000.0007e914a9cl 



root port 

0 

path cost 

0 

max age 

19.99 

bridge max age 

19.99 

hello time 

1.99 

bridge hello time 

1.99 

forward delay 

14.99 

bridge forward delay 

14.99 

ageing time 

0.99 



hello timer 

1.26 

ten timer 

0.00 

topology change timer 

3.37 

gc timer 

3.26 

flags 

TOPOLOGY_CHANGE 

TOPOLOGY_CHANGE_DETECTED 


ethO (0) 




port id 

0000 

state forwarding 

designated root 

8000.0007e914a9cl 

path cost 

100 

designated bridge 

8000.0007e914a9cl 

message age timer 

0.00 

designated port 

8001 

forward delay timer 0.00 

designated cost 

0 

hold timer 

0.26 


flags 


ethl (0) 
port id 

designated root 
designated bridge 
designated port 
designated cost 


0000 

8000.0007e914a9cl 

8000.0007e914a9cl 

8002 

0 


state forwarding 

path cost 19 

message age timer 0.00 
forward delay timer 0.00 
hold timer 0.26 


flags 


Here we can see the STP setup for a simple bridged network. The bridge 
device, brO, holds information for the bridge as a whole. This includes the bridge 
ID (8000.0007e914a9cl), derived from the smallest MAC address on the PC- 
based bridge (port 1) of Figure 3-11. The major configuration parameters (e.g., hello 
time, topology change timer, etc.) are given in seconds. The flags values indicate 
a recent topology change, which is expected given the fact that the network was 
recently connected. The rest of the output describes per-port information for ethO 
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(bridge port 1) and ethl (bridge port 2). Note that the path cost for ethO is about 
ten times greater than the cost of ethl. This is consistent with the observation that 
ethO is a lOMb/s Ethernet network and ethl is a full-duplex lOOMb/s network. 

We can use Wireshark to look at a BPDU. In Figure 3-16 we see the contents 
of a 52-byte BPDU. The length of 52 bytes (less than the Ethernet minimum of 64 
bytes because the Linux capture facility removed the padding) is derived from 
the Length/Type field of the Ethernet header by adding 14, in this case giving the 
length of 52. The destination address is the group address, 01:80:C2:00:00:00, as 
expected. The payload length is 38 bytes, the value contained in the Length field. 
The SNAP/LLC field contains the constant 0x424243, and the encapsulated frame 
is a spanning tree (version 0) frame. The rest of the protocol fields indicate that the 
station 00:07:e9:14:a9:cl believes it is the root of the spanning tree, using priority 
32768 (a low priority), and the BPDU has been sent from port 2 with priority 0x80. 
It also indicates a maximum age of 20s, a hello time of 2s, and a forwarding delay 
of 15s. 


5 stp.tr - Wireshark 


File Edit ^ew So Capture Analyze Statistics Telephony lools Help 

eiiii(waifeii^0X0fi,icc4ia-«7^i|M|S' 

No. Time Source Dest Protocol Info 

1 0.000000 00:07:e9:14:a9:cl 01:80:c2:00:00:00 STP conf. Root = 32768/0/00:07:e9:14:a9:cl Cost = 0 Port = 0x8002 

21.999949 00:07:e9:14:a9:Cl 01:a0:c2:00:00:00 stp conf. Root = 32768/0/00:07:e9:14:a9:cl cost = 0 port = 0x8002 

3 3.999952 00:07:e9:14:a9:Cl 01:80:c2:00:00:00 STP conf. Root - 32768/0/00:07:e9:14:a9:cl Cost = 0 Port = 0x8002 

4 5.999949 00:07:e9:14:a9:cl 01:80:c2:00:00:00 STP conf. Root = 32768/0/00:07:e9:14:a9:cl cost = 0 Port = 0x8002 

I > 

S Frame 1: 52 bytes on wire (416 bits), 52 bytes captured (416 bits) 

S IEEE 802.3 Ethernet 

a Destination: 01:80:c2:00:00:00 (01:80:c2:00:00:00) 

Address: 01:30:c2:00:00:00 (01:80:c2:00:00:00) 

.1. = IG bit: Group address (multicast/broadcast) 

.0. = LG bit: Globally unique address (factory default) 

a source: 00:07:e9:14:a9:cl (00:07:e9:14:a9:cl) 

Length: 38 

a Logical-Link Control 


dsap: spanning Tree bpdu (0x42) 


IG Bit: Individual 
SSAP: Spanning Tree BPDU (0x42) 

CR Bit: command 

a Control field: u, func=ui (0x03) 

000. 00.. = Command: unnumbered information (0x00) 

.11 = Frame type: unnumbered frame (0x03) 

a Spanning Tree Protocol 

protocol Identifier: spanning Tree protocol (0x0000) 

Protocol version Identifier: Spanning Tree (0) 

BPDU Type: Configuration (0x00) 

a BPDU flags: 0x00 

a Root Identifier: 32768 / 0 / 00:07:e9:14:a9:cl 
Root Bridge Priority: 32768 
Root Bridge system id Extension: 0 
Root Bridge System ID: 00:07:e9:14:a9:cl 
Root path cost: 0 

a Bridge identifier: 32768 / 0 / 00:07:e9:14:a9:cl 
Bridge Priority: 32768 
Bridge system id Extension: 0 
Bridge System ID: 00:07:e9:14:a9:cl 
Port identifier: 0x3002 
Message Age: 0 
Max Age: 20 
Hello Time: 2 
Forward Delay: 15 

Figure 3-16 Wireshark showing a BPDU. The Ethernet destination is a group address for bridges 
(01:80x2:00:00:00). 
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3.4.1.6 Rapid Spanning Tree Protocoi (RSTP) (Formeriy 802.1w) 

One of the perceived problems with conventional STP is that a change in topology 
is detected only by the failure to receive a BPDU in a certain amount of time. If 
the timeout is large, the convergence time (time to reestablish data flow along the 
spanning tree) could be larger than desired. The IEEE 802.1w standard (now part 
of [802.1D-2004]) specifies enhancements to the conventional STP and adopts the 
new name Rapid Spanning Tree Protocol (RSTP). The main improvement in RSTP 
over STP is to monitor the status of each port and upon indication of failure to 
immediately trigger a topology change indication. In addition, RSTP uses all 6 bits 
in the Flag field of the BPDU format to support agreements between bridges that 
avoid some of the need for timers to initiate protocol operations. It reduces the 
normal STP five port states to three (discarding, learning, and forwarding, as 
indicated by the state names in parentheses in Pigure 3-14). The discarding state 
in RSTP absorbs the disabled, blocking, and listening states in conventional STP. 
RSTP also creates a new port role called an alternate port, which acts as an immedi¬ 
ate backup should a root port cease to operate. 

RSTP uses only one type of BPDU, so there are no special topology change 
BPDUs, for example. RSTP BPDUs, as they are called, use version and type num¬ 
ber 2 instead of 0. In RSTP, any switch detecting a topology change sends BPDUs 
indicating a topology change, and any switch receiving them clears its filtering 
databases immediately. This change can significantly affect the protocol's con¬ 
vergence time. Instead of waiting for the topology change to migrate to the root 
bridge and back followed by the forwarding delay wait time, entries are cleared 
immediately. Overall, convergence time can be cut from tens of seconds down to a 
fraction of a second in most cases. 

RSTP makes a distinction between edge ports (those attached only to end sta¬ 
tions) and normal spanning tree ports and also between point-to-point links and 
shared links. Edge ports and ports on point-to-point links do not ordinarily form 
loops, so they are permitted to skip the listening and learning states and move 
directly to the forwarding state. Of course, the assumption of being an edge port 
could be violated if, for example, two ports were cross-connected, but this is han¬ 
dled by reclassifying ports as spanning tree ports if they ever carry any form of 
BPDUs (simple end stations do not normally generate BPDUs). Point-to-point links 
are inferred from the operating mode of the interface; if the interface is running in 
full-duplex mode, the link is classified as a point-to-point link. 

In regular STP, BPDUs are ordinarily relayed from a notifying or root bridge. 
In RSTP, BPDUs are sent periodically by all bridges as "keepalives" to determine 
if connections to neighbors are operating properly. This is what most higher-layer 
routing protocols do also. If a bridge fails to receive an updated BPDU within 
three times the hello interval, the bridge concludes that it has lost its connection 
with its neighbor. Note that in RSTP, topology changes are not induced as a result 
of edge ports being connected or disconnected as they are in regular STP. When 
a topology change is detected, the notifying bridge sends BPDUs with the TC bit 
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field set, not only to the root but also to all other bridges. Doing so allows the 
entire network to be notified of the topology change much faster than with con¬ 
ventional STP. When a bridge receives these messages, it flushes all table entries 
except those associated with edge ports and restarts the learning process. 

Many of RSTP's features were developed by Cisco Systems and other compa¬ 
nies that had for some time provided proprietary enhancements to regular STP in 
their products. The IEEE committee incorporated many of these enhancements into 
the updated 802.1d standard, which covers both types of STP, so extended LANs 
can run regular STP on some segments and RSTP on others (although the RSTP 
benefits are lost). RSTP has been extended to include VLANs [802.1Q-2005]—a 
protocol called the Multiple Spanning Tree Protocol (MSTP). This protocol retains 
the RSTP (and hence STP) BPDU format, so backward compatibility is possible, 
but it also supports the formation of multiple spanning trees (one for each VLAN). 

3.4.2 802.1 ak: Multiple Registration Protocol (MRP) 

The Multiple Registration Protocol (MRP) provides a general method for registering 
attributes among stations in a bridged LAN environment. [802.1ak-2007] defines 
two particular "applications" of MRP called MVRP (for registering VLANs) and 
MMRP (for registering group MAC addresses). MRP replaces the earlier CARP 
framework; MVRP and MMRP replace the older GVRP and GMRP protocols, 
respectively. All were originally defined by 802.1q. 

With MVRP, once an end station is configured as a member of a VLAN, this 
information is communicated to its attached switch, which in turn propagates 
the fact of the station's participation in the VLAN to other switches. This allows 
switches to augment their filtering tables based on station VLAN IDs and allows 
changes of VLAN topology without necessarily triggering a recalculation of the 
existing spanning tree via STP. Avoiding STP recalculation was one of the reasons 
for migrating from GVRP to MVRP. 

MMRP is a method for stations to register their interest in group MAG 
addresses (multicast addresses). This information may be used by switches to 
establish the ports through which multicast traffic must be delivered. Without 
such a facility, switches would have to broadcast all multicast traffic, potentially 
leading to unwanted overhead. MMRP is a layer 2 protocol with similarities to 
IGMP and MLD, layer 3 protocols, and the "IGMP/MLD snooping" capability sup¬ 
ported in many switches. We discuss IGMP, MLD and snooping in Ghapter 9. 


3.5 Wireless LANs—IEEE 802.11 (Wi-Fi) 

One of the most popular technologies being used to access the Internet today is 
wireless fidelity (Wi-Pi), also known by its IEEE standard name 802.11, effectively 
a wireless version of Ethernet. Wi-Ei has developed to become an inexpensive, 
highly convenient way to provide connectivity and performance levels acceptable 
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for most applications. Wi-Fi networks are easy to set up, and most portable com¬ 
puters and smartphones now include the necessary hardware to access Wi-Fi 
infrastructure. Many coffee shops, airports, hotels, and other facilities include 
Wi-Fi "hot spots," and Wi-Fi is even seeing considerable advancement in develop¬ 
ing countries where other infrastructure may be difficult to obtain. The architec¬ 
ture of an IEEE 802.11 network is shown in Eigure 3-17. 


Basic Service Set 



Figure 3-17 The IEEE 802.11 terminology for a wireless LAN. Access points (APs) can be connected 
using a distribution service (DS, a wireless or wired backbone) to form an extended 
WLAN (called an ESS). Stations include both APs and mobile devices communicating 
together that form a basic service set (BSS). Typically, an ESS has an assigned ESSID that 
functions as a name for the network. 


The network in Eigure 3-17 includes a number of stations (STAs). Typically 
stations are organized with a subset operating also as access points (APs). An AP 
and its associated stations are called a basic service set (BSS). The APs are generally 
connected to each other using a wired distribution service (called a DS, basically a 
"backbone"), forming an extended service set (ESS). This setup is commonly termed 
infrastructure mode. The 802.11 standard also provides for an ad hoc mode. In this 
configuration there is no AP or DS; instead, direct station-to-station (peer-to-peer) 
communication takes place. In IEEE terminology, the STAs participating in an 
ad hoc network form an independent basic service set (IBSS). A WLAN formed from 
a collection of BSSs and/or IBSSs is called a service set, identified by a service set 
identifier (SSID). An extended service set identifier (ESSID) is an SSID that names a 
collection of connected BSSs and is essentially a name for the LAN that can be up 
to 32 characters long. Such names are ordinarily assigned to Wi-Ei APs when a 
WLAN is first installed. 
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3.5.1 802.11 Frames 

There is one common overall frame format for 802.11 networks but multiple types 
of frames. Not all the fields are present in every type of frame. Figure 3-18 shows 
the format of the common frame and a (maximal-size) data frame. 


Physical-Layer PDU 


PLCP Header 


MAC PDU (MPDU) 


Optional 


Frame 

Ctrl 

Dura 

lion/ 

ID 

Address 1 

Address 2 

Address 3 

Seq 

Ctrl 

Address 4 

QoS 

Ctrl 

HT 

Control 

Frame 

Body 

FCS 


( 2 ) ( 2 ) 


(6 bytes) 


(6) 


(6) 

MAC Header 


( 2 ) 


(6) 


(4) 


Figure 3-18 The 802.11 basic data frame format (as of [802.11n-2009]). The MPDU format resembles that of 
Ethernet but has additional fields depending on the type of DS being used among access points, 
whether the frame is headed to the DS or from it, and if frames are being aggregated. The QoS 
Control field is used for special performance features, and the HT Control field is used for control 
of 802.11n's "high-throughput" features. 


The frame shown in Figure 3-18 includes a preamble for synchronizafion, 
which depends on fhe parficular varianf of 802.11 being used. Nexf, fhe Physical 
Layer Convergence Procedure (PLCP) header provides informafion abouf fhe spe¬ 
cific physical layer in a somewhaf PHY-independenf way. The PLCP porfion of fhe 
frame is generally fransmiffed af a lower dafa rafe fhan fhe resf of fhe frame. This 
serves fwo purposes: fo improve fhe probabilify of correcf delivery (lower speeds 
fend fo have beffer error resisfance) and fo provide compafibilify wifh and profec- 
fion from inferference from legacy equipmenf fhaf may operafe in fhe same area af 
slower rales. The MAC PDU (MPDU) corresponds fo a frame similar fo Efhernef, 
buf wifh some addifional fields. 

Af fhe head of fhe MPDU is fhe Frame Control Word, which includes a 2-bif 
Type field idenfifying fhe frame type. There are three types of frames: management 
frames, control frames, and data frames. Each of these can have various subtypes, 
depending on the type. The full table of types and subtypes is given in [802.1In- 
2009, Table 7-1]. The contents of the remaining fields, if present, are determined by 
the frame type, which we discuss individually. 

3.5.1.1 Management Frames 

Management frames are used for creating, maintaining, and ending associations 
between stations and access points. They are also used to determine whether 
encryption is being used, what the name (SSID or ESSID) of the network is, what 
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transmission rates are supported, and a common time base. These frames are used 
to provide the information necessary when a Wi-Fi interface "scans" for nearby 
access poinfs. 

Scanning is fhe procedure by which a sfafion discovers available nefworks 
and relafed configurafion informafion. This involves swifching fo each available 
frequency and passively lisfening for fraffic fo idenfify available access poinfs. Sfa- 
fions may also acfively probe for nefworks by fransmiffing a parficular manage- 
menf frame ("probe requesf") while scanning. There are some limifafions on such 
probe requesfs fo ensure fhaf 802.11 fraffic is nof fransmiffed on a frequency fhaf 
is being used for non-802.11 purposes (e.g., medical services). Here is an example 
of inifiafing a scan by hand on a Linux sysfem: 


Linux# iwlist wlanO scan 

wlanO Scan completed : 

Cell 01 - Address: 00:02:6F:20:B5:84 

ESSID:"GrizzlY-5354-Aries-802.llb/g" 

Mode:Master 
Channel:4 

Frequency:2.427 GHz (Channel 4) 

QualitY=5/100 Signal level=47/100 
Enc ryp tion key:on 
IE: WPA Version 1 

Group Cipher : TRIP 
Pairwise Ciphers (2) : CCMP TRIP 

Authentication Suites (1) : PSR 

Bit Rates:1 Mb/s; 2 Mb/s; 5.5 Mb/s; 11 Mb/s; 

6 Mb/s; 12 Mb/s; 24 Mb/s; 36 Mb/s; 9 Mb/s; 

18 Mb/s; 48 Mb/s; 54 Mb/s 
Extra:tsf=0000009d832ff037 

Here we see fhe resulf of a hand-inifiafed scan using wireless inferface wlanO. 
An AP wifh MAC address 00:02: 6F: 20:B5:84 is acfing as a masfer (i.e., is acf- 
ing as an AP in infrasfrucfure mode). If is broadcasfing fhe ESSID "Grizzly- 
5354-Aries-802 .llb/g" on channel 4 (2.427GHz). (See Secfion 3.5.4 on channels 
and frequencies for more defails on channel selecfion.) The qualify and signal 
level give indicafions of how well fhe scanning sfafion is receiving a signal from 
fhe AP, alfhough fhe meaning of fhese values varies among manufacfurers. WPA 
encrypfion is being used on fhis link (see Secfion 3.5.5), and bif rafes from IMb/s 
fo 54Mb/s are available. The tsf {time sync function) value indicafes fhe AP's 
nofion of fime, which is used for synchronizing various feafures such as power¬ 
saving mode (see Secfion 3.5.2). 

When an AP broadcasfs ifs SSID, any sfafion may affempf fo esfablish an 
associafion wifh fhe AP. When an associafion is esfablished, mosf Wi-Fi nefworks 
today also sef up fhe necessary configurafion informafion fo provide Infernef 
access fo fhe sfafion (see Chapter 6). However, an AP's operator may wish fo con- 
frol which sfafions make use of fhe nefwork. Some operators infenfionally make 
fhis more difficulf by having fhe AP nof broadcasf ifs SSID, as a securify measure. 



Section 3.5 Wireless LANs—IEEE 802.11(Wi-Fi) 


115 


This approach provides little security, as the SSID may be guessed. More robust 
security is provided by link encryption and passwords, which we discuss in Sec¬ 
tion 3.5.5. 

3.5.1.2 Control Frames: RTS/CTS and ACKs 

Control frames are used to handle a form of flow confrol as well as acknowl- 
edgmenfs for frames. Flow confrol helps ensure fhaf a receiver can slow down a 
sender fhaf is too fasf. Acknowledgmenfs help a sender know whaf frames have 
been received correcfly. These concepfs also apply to TCP af fhe fransporf layer 
(see Chapter 15). 802.11 nefworks supporf opfional request-to-send (RTS)/clear-to- 
send (CTS) moderafion of fransmission for flow confrol. When fhese are enabled, 
prior to sending a dafa frame a sfafion fransmifs an RTS frame, and when fhe 
recipienf is willing to receive addifional fraffic, if responds wifh a CTS. After fhe 
RTS/CTS exchange, fhe sfafion has a window of fime (idenfified in fhe CTS frame) 
to fransmif dafa frames fhaf are acknowledged when successfully received. Such 
fransmission coordinafion schemes are common in wireless nefworks and mimic 
fhe flow confrol signaling fhaf has been used on wired serial lines for years (some- 
fimes called hardware flow confrol). 

The RTS/CTS exchange helps to avoid fhe hidden terminal problem by insfrucf- 
ing each sfafion when if is permiffed fo fransmif, so as to avoid simulfaneous 
fransmissions from sfafions fhaf cannof hear each ofher. Because RTS and CTS 
frames are shorf, fhey do nof use fhe channel for long. An AP generally inifiafes 
an RTS/CTS exchange for a packef if fhe size of fhe packef is large enough. Typi¬ 
cally, an AP has a configurafion opfion called fhe packet size threshold (or similar). 
Frames larger fhan fhe fhreshold cause an RTS fo be senf prior fo fransmission of 
fhe dafa. Mosf vendors use a defaulf seffing for fhis value of approximafely 500 
byfes if RTS/CTS exchanges are desired. In Linux, fhe RTS/CTS fhreshold can be 
sef in fhe following way: 


Linux# iwconfig wlanO rts 250 

wlanO IEEE 802.llg ESSID:"Grizzly-5354-Aries-802.llb/g" 

Mode:Managed 
Frequency:2.427 GH 
Access Point: 00:02:6F:20:B5:84 
Bit Rate=24 Mb/s Tx-Power=0 dBm 

Retry min limit:7 RTS thr=250 B Fragment thr=2346 B 
Encryption key:xxxx- ... -xxxx [3] 

Link Quality=100/100 Signal level=46/100 

Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0 

Tx excessive retries:0 Invalid misc:0 Missed beacon:0 


The iwconfig command can be used fo sef many variables, including fhe RTS 
and fragmenfafion fhresholds (see Secfion 3.5.1.3). If can also be used fo defermine 
sfafisfics such as fhe number of frame errors due fo wrong nefwork ID (ESSID) or 
wrong encrypfion key. If also gives fhe number of excessive refries (i.e., fhe num¬ 
ber of refransmission affempfs), a rough indicafor of fhe reliabilify of fhe link fhaf 
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is popular for guiding routing decisions in wireless networks [ETX]. In WLANs 
with limited coverage, where hidden terminal problems are unlikely to occur, it 
may be preferable to disable RTS/CTS by adjusting the stations' RTS thresholds to 
be a high value (1500 or larger). This avoids the overhead imposed by requiring 
RTS/CTS exchanges for each packet. 

In wired Ethernet networks, the absence of a collision indicates that a frame 
has been received correctly with high probability. In wireless networks, there is 
a wider range of reasons a frame may not be delivered correctly, such as insuffi¬ 
cient signal or interference. To help address this potential problem, 802.11 extends 
the 802.3 retransmission scheme with a retransmission/acknowledgment (ACK) 
scheme. An acknowledgment is expected to be received within a certain amount 
of time for each unicast frame sent (802.11a/b/g) or each group of frames sent 
(802.11n or 802.11e with "block ACKs"). Multicast and broadcast frames do not 
have associated ACKs to avoid 'ACK implosion" (see Chapter 9). Eailure to receive 
an ACK within the specified time results in retransmission of the frame(s). 

With retransmissions, it is possible to have duplicate frames formed within 
the network. The Retry bit field in the Frame Control Word is set when any frame 
represents a retransmission of a previously transmitted frame. A receiving station 
can use this to help eliminate duplicate frames. Stations are expected to keep a 
small cache of entries indicating addresses and sequence/fragment numbers seen 
recently. When a received frame matches an entry, the frame is discarded. 

The amount of time necessary to send a frame and receive an ACK for it 
relates to the distance of the link and the slot time (a basic unit of time related to 
the 802.11 MAC protocol; see Section 3.5.3). The time to wait for an ACK (as well as 
the slot time) can be configured in most systems, although the method for doing 
so varies. In most cases such as home or office use, the default values are adequate. 
When using Wi-Ei over long distances, these values may require adjusting (see, for 
example, [MWLD]). 

3.5.1.3 Data Frames, Fragmentation, and Aggregation 

Most frames seen on a busy network are data frames, which do what one would 
expect—carry data. Typically, there is a one-to-one relationship between 802.11 
frames and the link-layer (LLC) frames made available to higher-layer proto¬ 
cols such as IR However, 802.11 supports frame fragmentation, which can divide 
frames into multiple fragments. With the 802.11n specification, it also supports 
frame aggregation, which can be used to send multiple frames together with less 
overhead. 

When fragmentation is used, each fragment has its own MAC header and trail¬ 
ing CRC and is handled independently of other fragments. Eor example, fragments 
to different destinations can be interleaved. Eragmentation can help improve per¬ 
formance when the channel has significant interference. Unless block ACKs are 
used, each fragment is sent individually, producing one ACK per fragment by the 
receiver. Because fragments are smaller than full-size frames, if a retransmission 
needs to be invoked, a smaller amount of data will need to be repaired. 
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Fragmentation is applied only to frames with a unicast (non-broadcast or 
multicast) destination address. To enable this capability, the Sequence Control field 
contains a fragment number (4 bits) and a sequence number (12 bits). If a frame is frag¬ 
mented, all fragments contain a common sequence number value, and each adja¬ 
cent fragment has a fragment number differing by 1. A total of 15 fragments for the 
same frame is possible, given the 4-bit-wide field. The More Frag field in the Frame 
Control Word indicates that further fragments are yet to come. Terminal fragments 
have this bit set to 0. A destination defragments the original frame from fragments 
it receives by assembling the fragments in order based on fragment number order 
within the frame sequence number. Provided that all fragments constituting a 
sequence number have been received and the last fragment has a More Frag field of 
0, the frame is reconstructed and passed to higher-layer protocols for processing. 

Fragmentation is not often used because it does require some tuning. If used 
without tuning, it can worsen performance slightly. When smaller frames are 
used, the chance of having a bit error (see the next paragraph) can be reduced. 
The fragment size can usually be set from 256 bytes to 2KB as a threshold (only 
those frames that exceed the threshold in size are fragmented). Many APs default 
to not using fragmentation by setting the threshold high (such as 2437 bytes on a 
Linksys-brand AP). 

The reason fragmentation can be useful is a fairly simple exercise in prob¬ 
ability. If the bit error rate (BER) is P, the probability of a bit being successfully 
delivered is (1 - P) and the probability that N bits are successfully delivered is 
(1 - P)“. As N grows, this value shrinks. Thus, if we can shorten a frame, we can 
in principle improve its error-free delivery probability. Of course, if we divide a 
frame of size N bits into K fragments, we have to send at least Tn/k! fragments. As 
a concrete example, assume that we wish to send a 1500-byte (12,000-bit) frame. 
If we assume P = 10'^ (a relatively high BER), the probability of successful deliv¬ 
ery without fragmentation would be (1 -10 "')'^®““ = .301. So we have only about a 
30% chance of such a frame being delivered without errors the first time, and on 
average we would have to send the frame three or four times for it to be received 
successfully. 

If we use fragmentation for the same example and set the fragmentation thresh¬ 
old to 500, we produce three fragments of about 4000 bits each. The probability of 
one such fragment being delivered without error is about (1 -10'^)^“““ = .670. Thus, 
each fragment has about a 67% chance of being delivered successfully. Of course, 
we have to have three of them delivered successfully to reconstruct the whole 
frame. The probabilities of 3, 2, 1, and 0 fragments being delivered successfully 
are (.67)^ = 0.30, 3{.67f{33) = 0.44, 3(0.67)(.33)2 = .22, and (.33)^ = .04, respectively. 
So, although the chances that all three are delivered successfully without retries 
are about the same as for the nonfragmented frame being delivered successfully, 
the chances that two or three fragments are delivered successfully are fairly good. 
If this should happen, at most a single fragment would have to be retransmit¬ 
ted, which would take significantly less time (about a third) than sending the 
original 1500-byte unfragmented frame. Of course, each fragment consumes some 
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overhead, so if the BER is effectively 0, fragmentation only decreases performance 
by creating more frames to handle. 

One of the enhancements provided by 802.11n is the support of frame 
aggregation, in two forms. One form, called the aggregated MAC service data unit 
(A-MSDU), allows for multiple complete 802.3 (Ethernet) frames to be aggregated 
within an 802.11 frame. The other form, called the aggregated MAC protocol data unit 
(A-MPDU), allows multiple MPDUs with the same source, destination, and QoS 
settings to be aggregated by being sent in short succession. The two aggregation 
types are depicted in Pigure 3-19. 
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Figure 3-19 Frame aggregation in 802.11n includes A-MSDLf and A-MPDU. A-MSDU aggregates frames using 
a single FCS. A-MPDU aggregation uses a 4-byte delimiter between each aggregated 802.11 frame. 
Each A-MPDU subframe has its own FCS and can be individually acknowledged using block 
ACKs and retransmitted if necessary. 


Eor a single aggregate, the A-MSDU approach is technically more efficient. 
Each 802.3 header is ordinarily 14 bytes, which is relatively small compared to 
an 802.11 MAC header that could be as long as 36 bytes. Thus, with only a single 
802.11 MAC header for multiple 802.3 frames, a savings of up to 22 bytes per extra 
aggregated frame could be achieved. An A-MSDU may be up to 7935 bytes, which 
can hold over 100 small (e.g., 50-byte) packets, but only a few (5) larger (1500- 
byte) data packets. The A-MSDU is covered by a single PCS. This larger size of an 
A-MSDU frame increases the chances it will be delivered with errors, and because 
there is only a single PCS for the entire aggregate, the entire frame would have to 
be retransmitted on error. 
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A-MPDU aggregation is a different form of aggregation whereby multiple (up 
to 64) 802.11 frames, each with its own 802.11 MAC header and PCS and up to 4095 
bytes each, are sent together. A-MPDUs may carry up to 64KB of data—enough 
for more than 1000 small packets and about 40 larger (1.5KB) packets. Because 
each constituent frame (subframe) carries its own PCS, it is possible to selectively 
retransmit only those subframes received with errors. This is made possible by the 
block acknowledgment facility in 802.11n (originating in 802.11e), which is a form of 
extended ACK that provides feedback to a transmitter indicating which particular 
A-MPDU subframes were delivered successfully. This capability is similar in pur¬ 
pose, but not in its details, to the selective acknowledgments we will see in TCP 
(see Chapter 14). So, although the type of aggregation offered by A-MSDUs may 
be more efficient for error-free networks carrying large numbers of small packets, 
in practice it may not perform as well as A-MPDU aggregation [S08]. 

3.5.2 Power Save Mode and the Time Sync Function (TSF) 

The 802.11 specification provides a way for stations to enter a limited power state, 
called power save mode (PSM). PSM is designed to save power by allowing an STA's 
radio receive circuitry to be powered down some of the time. Without PSM, the 
receiver circuitry would always be running, draining power. When in PSM, an 
STA's outgoing frames have a bit set in the Frame Control Word. A cooperative 
AP noticing this bit being set buffers any frames for the station until the station 
requests them. APs ordinarily send out beacon frames (a type of management 
frame) indicating various things like SSID, channel, and authentication informa¬ 
tion. When supporting stations that use PSM, APs can also indicate the presence 
of buffered frames to a station by setting an indication in the Frame Control Word 
of frames it sends. When stations enter PSM, they do so until the next AP beacon 
time, when they wake up and determine if there are pending frames stored at the 
AP for them. 

PSM should be used with care and understanding. Although it may extend bat¬ 
tery life, the NIC is not the only module drawing power in most wireless devices. 
Other parts of the system such as the screen and hard drive can be significant con¬ 
sumers of power, so overall battery life may not be extended much. Furthermore, 
using PSM can affect throughput performance significantly as idle periods are 
added between frame transmissions and time is spent switching modes [SHK07]. 

The ability to awaken an STA to check for pending frames at exactly the cor¬ 
rect time (i.e., when an AP is about to send a beacon frame) depends on a common 
sense of time at the AP and the PSM stations it serves. Wi-Fi synchronizes time 
using the time synchronization function (TSF). Each station maintains a 64-bit coun¬ 
ter reference time (in microseconds) that is synchronized with other stations in the 
network. Synchronization is maintained to within 4ps plus the maximum propa¬ 
gation delay of the PHY (for PHYs of rate IMb/s or more). This is accomplished 
by having any station that receives a TSF update (basically, a copy of the 64-bit 
counter sent from another station) check to see if the provided value is larger than 
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its own. If so, the receiving station updates its own notion of time to be the larger 
value. This approach ensures that clocks always move forward, but it also raises 
some concern that, given stations with slightly differing clock rates, the slower 
ones will tend to be synced to the fastest one. 

With the incorporation of 802.11e (QoS) features into 802.11, the basic PSM of 
802.11 has been extended to include the ability to schedule periodic batch process¬ 
ing of buffered frames. The frequency is expressed in terms of the number of bea¬ 
con frames. The capability, called automatic power save delivery (APSD), uses some 
of the subfields of the QoS control word. APSD may be especially useful for small 
power-constrained devices, as they need not necessarily awaken at each beacon 
interval as they do in conventional 802.11 PSM. Instead, they may elect to power 
down their radio transceiver circuitry for longer periods of their own choosing. 
802.11n also extends the basic PSM by allowing an STA equipped with multiple 
radio circuits operating together (see MIMQ, Section 3.5.4.2) to power down all but 
one of the circuits until a frame is ready. This is called spatial multiplexing power 
save mode. The specification also includes an enhancement to APSD called Power 
Save Multi-Poll (PSMP) that provides a way to schedule transmissions of frames in 
both directions (e.g., to and from AP) at the same time. 

3.5.3 802.11 Media Access Control 

In wireless networks, it is much more challenging to detect a "collision" than in 
wired networks such as 802.3 LANs. In essence, the medium is effectively sim¬ 
plex, and multiple simultaneous transmitters must be avoided, by coordinating 
transmissions in either a centralized or a distributed manner. The 802.11 stan¬ 
dard has three approaches to control sharing of the wireless medium, called the 
point coordination function (PCF), the distributed coordinating function (DCF), and 
the hybrid coordination function (HCF). HCF was brought into the 802.11 specifica¬ 
tion [802.11-2007] with the addition of QoS support in 802.11e and is also used by 
802.11n. Implementation of the DCF is mandatory for any type of station or AP, but 
implementation of the PCF is optional and not widespread (so we shall not discuss 
it in detail). HCF is found in relatively new QoS-capable Wi-Fi equipment, such as 
802.11n APs and earlier APs that support 802.11e. We turn our attention to DCF for 
now and describe HCF in the context of QoS next. 

DCF is a form of CSMA/CA for contention-based access to the medium. It is 
used for both infrastructure and ad hoc operation. With CSMA/CA, stations listen 
to see if the medium is free and, if so, may have an opportunity to transmit. If not, 
they avoid sending for a random amount of time before checking again to see if 
the medium is free. This behavior is similar to how a station sensing a collision 
backs off when using CSMA/CD on wired LANs. Channel arbitration in 802.11 is 
based on CSMA/CA with enhancements to provide priority access to certain sta¬ 
tions or frame types. 

802.11 carrier sense is performed in both a physical and a virtual way. Gener¬ 
ally, stations wait for a period of time when ready to send (called the distributed 
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inter-frame space or DIFS) to allow higher-priority stations to access the channel. 
If the channel becomes busy during the DIFS period, a station starts the waiting 
period again. When the medium appears idle, a would-be transmitter initiates 
the collision avoidance/backoff procedure described in Section 3.5.3.3. This pro¬ 
cedure is also initiated after a successful (unsuccessful) transmission is indicated 
by the receipt (lack of receipt) of an ACK. In the case of unsuccessful transmission, 
the backoff procedure is initiated with different timing (using the extended inter¬ 
frame space or EIFS). We now discuss the implementation of DCF in more detail, 
including the virtual and physical carrier sense mechanisms. 

3.5.3.1 Virtual Carrier Sense, RTS/CTS, and the Network Allocation Vector (NAV) 
In the 802.11 MAC protocol, a virtual carrier sense mechanism operates by observ¬ 
ing the Duration field present in each MAC frame. This is accomplished by a sta¬ 
tion listening to traffic not destined for it. The Duration field is present in both 
RTS and CTS frames optionally exchanged prior to transmission, as well as con¬ 
ventional data frames, and provides an estimate of how long the medium will be 
busy carrying the frame. 

The transmitter sets the Duration field based on the frame length, transmit 
rate, and PHY characteristics (e.g., rate, etc.). Each station keeps a local counter 
called the Network Allocation Vector (NAV) that estimates how long the medium 
will be busy carrying the current frame, and consequently how long it will need to 
wait before attempting its next transmission. A station overhearing traffic with a 
Duration field greater than its NAV updates its NAV to the new value. Because the 
Duration field is present in both RTS and CTS frames, if used, any station in range 
of either the sender or the receiver is able to ascertain the Duration field value. The 
NAV is maintained in time units and decremented based on a local clock. The 
medium is considered busy when the local NAV is nonzero. It is reset to 0 upon 
receipt of an ACK. 

3.5.3.2 Physical Carrier Sense (CCA) 

Each 802.11 PHY specification (e.g., for different frequencies and radio technology) 
is required to provide a function for assessing whether the channel is clear based 
upon energy and waveform recognition (usually recognition of a well-formed 
PLCP). This function is called clear channel assessment (CCA) and its implementa¬ 
tion is PHY-dependent. The CCA capability represents the physical carrier sense 
capability for the 802.11 MAC to understand whether the medium is currently 
busy. It is used in conjunction with the NAV to determine when a station must 
defer (wait) prior to transmission. 

3.5.3.3 DCF Collision Avoidance/Backoff Procedure 

Upon determining that the channel is likely to be free (i.e., because the NAV dura¬ 
tion has been met and CCA does not indicate a busy channel), a station defers 
access prior to transmission. Because many stations may have been waiting for 
the channel to become free, each station computes and waits for a backoff time prior 
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to sending. The backoff fime is equal fo fhe producf of a random number and fhe 
slot time (unless fhe sfafion affempfing fo fransmif already has a nonzero backoff 
fime, in which case if is nof recompufed). The slof fime is PHY-dependenf buf is 
generally a few fens of microseconds. The random number is drawn from a uni¬ 
form disfribufion over fhe inferval [0, CW], where fhe contention window (CW) is 
an infeger confaining a number of fime slofs fo waif, wifh limifs aCWmin < CW 
< aCWmax defined by fhe PHY. The sef of CW values increases by powers of 2 
(minus 1) beginning wifh fhe PHY-specific consfanf aCWmin value and confinu- 
ing up fo and including fhe consfanf aCWmax value for each successive frans- 
mission affempf. This is similar in effecf fo Efhernef's backoff procedure inifiafed 
during a collision defecfion evenf. 

In a wireless environment collision detection is nof pracfical because if is dif- 
ficulf for a fransmiffer and receiver fo operafe simulfaneously in fhe same piece of 
equipmenf and hear any fransmissions ofher fhan ifs own, so collision avoidance 
is used insfead. In addifion, ACKs are generafed in response fo unicasf frames fo 
defermine whefher a frame has been delivered successfully. A sfafion receiving 
a correcf frame begins fransmiffing an ACK affer waifing a small period of fime 
(called fhe Short Interframe Space or SIPS), wifhouf regard fo fhe busy/idle sfafe of 
fhe medium. This should nof cause a problem because fhe SIPS value is always 
smaller fhan DIPS, so in effecf sfafions generafing ACKs gef priorify access fo fhe 
channel fo complefe fheir fransacfions. The source sfafion waifs a cerfain amounf 
of fime wifhouf receiving an ACK frame before concluding fhaf a fransmission 
has failed. Upon failure, fhe backoff procedure discussed previously is inifiafed 
and fhe frame is refried. The same procedure is inifiafed if a CTS is nof received in 
response fo an earlier RTS wifhin a cerfain (differenf) amounf of fime (a consfanf 
called CTStimeout). 

3.5.3.4 HCF and 802.11e/n QoS 

Clauses 5, 6, 7, and 9 of fhe 802.11 sfandard [802.11-2007] are based in parf on fhe 
work of fhe 802.11e group wifhin IEEE, and fhe ferms 802.11e, Wi-Pi QoS, and 
WMM (for Wi-Pi Mulfimedia) are offen used. They cover fhe QoS facility —changes 
fo fhe 802.11 MAC-layer and sysfem inferfaces in supporf of mulfimedia applica- 
fions such as voice over IP (VoIP) and sfreaming video. Whefher fhe QoS facilify 
is really necessary or nof offen depends on fhe congesfion level of fhe nefwork 
and fhe fypes of applicafions fo be supporfed. If ufilizafion of fhe nefwork fends 
fo be low, fhe QoS MAC supporf may be unnecessary, alfhough some of fhe ofher 
802.11e capabilifies may sfill be useful (e.g., block ACKs and APSD). In sifuafions 
where ufilizafion and congesfion are high and fhere is a need fo supporf a low- 
jiffer delivery capabilify for services such as VoIP, QoS supporf may be desirable. 
These specificafions are relafively new, so QoS-capable Wi-Pi equipmenf is likely 
fo be more expensive and complex fhan non-QoS equipmenf. 

The QoS facilify infroduces new ferminology such as QoS stations (QSTAs), 
QoS access points (QAPs), and fhe QoS BSS (QBSS, a BSS supporfing QoS). In gen¬ 
eral, any of fhe devices supporfing QoS capabilifies also supporf convenfional 
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non-QoS operation. 802.11n "high-throughput" stations (called HT STAs) are 
also QSTAs. A new form of coordinafion funcfion, fhe hybrid coordination function 
(HCF), supporfs bofh confenfion-based and confrolled channel access, alfhough 
fhe confrolled channel varianf is seldom used. Wifhin fhe HCF, fhere are fwo spec¬ 
ified channel access mefhods fhaf can operafe fogefher: HFCA-controlled channel 
access (HCCA) and fhe more popular enhanced DCF channel access (EDCA), cor¬ 
responding fo reservafion-based and confenfion-based access, respecfively. There 
is also some supporf for admission control, which may deny connecfivify enfirely 
under high load. 

EDCA builds upon fhe basic DCE access. Wifh EDCA, fhere are eighf user 
priorities (UPs) fhaf are mapped fo four access categories (ACs). The user priorifies 
use fhe same sfrucfure as 802.1d priorify fags and are numbered 1 fhrough 7, wifh 
7 being fhe highesf priorify. (There is also a 0 priorify befween 2 and 3.) The four 
ACs are nominally infended for background, besf-efforf, video, and audio fraffic. 
Priorifies 1 and 2 are infended for fhe background AC, priorifies 0 and 3 are for 
fhe besf-efforf AC, 4 and 5 are for fhe video AC, and 6 and 7 are for fhe voice AC. 
Por each AC, a varianf of DCE confends for channel access credifs called transmit 
opportunities (TXOPs), using alfernafive MAC paramefers fhaf fend fo favor fhe 
higher-priorify fraffic. In EDCA, many of fhe various MAC paramefers from DCE 
(e.g., DIES, aCWmin, aCWmax) become adjusfable as configurafion paramefers. 
These values are communicafed fo QSTAs using managemenf frames. 

HCCA builds loosely upon PCP and uses polling-confrolled channel access. 
If is designed for synchronous-sfyle access confrol and fakes precedence ahead of 
fhe confenfion-based access of EDCA. A hybrid coordinator (HC) is locafed wifhin 
an AP and has priorify fo allocafe channel accesses. Prior fo fransmission, a sfafion 
can issue a traffic specification (TSPEC) for ifs fraffic and use UP values befween 8 
and 15. The HC can allocafe reserved TXQPs fo such requesfs fo be used during 
shorf-durafion confrolled access phases of frame exchange fhaf fake place before 
EDCA-based frame fransmission. The HC can also deny TXQPs fo TSPECs based 
on admission confrol policies sef by fhe nefwork adminisfrafor. The HCP exploifs 
fhe virfual carrier sense mechanism discussed earlier wifh DCE fo keep confen¬ 
fion-based sfafions from inferfering wifh confenfion-free access. Nofe fhaf a single 
nefwork comprising QSTAs and convenfional sfafions can have bofh HCP and 
DCE running simulfaneously by alfernafing befween fhe fwo, buf ad hoc nefworks 
do nof supporf fhe HC and fhus do nof handle TSPECs and do nof perform admis¬ 
sion confrol. Such nefworks mighf sfill run HCP, buf TXQPs are gained fhrough 
EDCA-based confenfion. 

3.5.4 Physical-Layer Details: Rates, Channels, and Frequencies 

The [802.11-2007] standard now includes the following earlier amendments: 
802.11a, 802.11b, 802.11d, 802.11g, 802.11h, 802.11i, 802.11j, and 802.11e. The 802.11n 
standard was adopted as an amendment to 802.11 in 2009 [802.11n-2009]. Most 
of these amendments provide additional modulation, coding, and operating 
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frequencies for 802.11 nefworks, buf 802.11n also adds mulfiple dafa sfreams and 
a mefhod for aggregafing mulfiple frames (see Secfion 3.5.1.3). We will avoid 
defailed discussion of fhe physical layer, buf fo appreciafe fhe breadfh of opfions. 
Table 3-2 includes fhose parfs of fhe 802.11 sfandard fhaf describe fhis layer in 
parficular. 


Table 3-2 Parts of the 802.11 standard that describe the physical layer 


Standard 

(Clause) 

Speeds (Mb/s) 

Frequency Range; Modulation 

Channel Set 

802.11a 
(Clause 17) 

6, 9,12,18,24,36, 
48,54 

5.16-5.35 and 5.725-5.825GHz; 
OFDM 

34-165 (varies by country) 
20MHz/10MHz/5MHz 
channel width options 

802.11b 
(Clause 18) 

1, 2, 5.5,11 

2.401-2.495GHz; DSSS 

1-14 (varies by country) 

802.11g 

(Clause 19) 

1, 2, 5.5, 6, 9,11,12, 

18, 24,36,48, 54 
(plus 22,33) 

2.401-2.495GHz; OFDM 

1-14 (varies by country) 

802.11n 

6.5-600 with many 
options (up to 4 
MIMO streams) 

2.4 and 5GFIz modes with 

20MHz- or 40MHz-wide 
channels; OFDM 

1-13 (2.4GHz band); 

36-196 (5GHz band) 

(varies by country) 

802.11y 

(Same as 

802.11-2007) 

3.650-3.700GHZ (licensed); 

OFDM 

1-25, 36-64,100-161 
(varies by country) 


The firsf column gives fhe original sfandard name and ifs presenf locafion in 
[802.11-2007], plus defails for fhe 802.11n and 802.11y amendmenfs. If is imporfanf 
fo nofe from fhis fable fhaf 802.11b/g operafe in fhe 2.4GHz Industrial, Scientific, and 
Medical (ISM) band, 802.11a operafes only in fhe higher 5GHz Unlicensed National 
Information Infrastructure (U-Nll) band, and 802.11n can operafe in bofh. The 
802.11y amendmenf provides for licensed use in fhe 3.65-3.70GHz band wifhin 
fhe Unifed Sfafes. An imporfanf pracfical consequence of fhe dafa in fhis fable is 
fhaf 802.11b/g equipmenf does nof inferoperafe or inferfere wifh 802.11a equip¬ 
ment buf 802.11n equipmenf may inferfere wifh eifher if nof deployed carefully. 

3.5.4.1 Channels and Frequencies 

Regulafory bodies (e.g., fhe Federal Gommunicafions Gommission in fhe Unifed 
Sfafes) divide fhe elecfromagnefic specfrum info frequency ranges allocafed for 
various uses across fhe world. For each range and use, a license may or may nof 
be required, depending on local policy. In 802.11, fhere are sefs of channels fhaf 
may be used in various ways af various power levels depending on fhe regula¬ 
fory domain or counfry. Wi-Fi channels are numbered in 5MHz unifs sfarfing af 
some base cenfer frequency. For example, channel 36 wifh a base cenfer frequency 
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of 5.00GHz gives the frequency 5000 + 36 5 = 5180MHz, the center frequency of 
channel 36. Although channel center frequencies are 5MHz apart from each other, 
channels may be wider than 5MHz (up to 40MHz for 802.11n). Consequently, some 
channels within channel sets of the same band usually overlap. Practically speak¬ 
ing, this means that transmissions on one channel might interfere with transmis¬ 
sions on nearby channels. 

Figure 3-20 presents the channel-to-frequency mapping for the 802.11b/g 
channels in the 2.4GHz ISM band. Each channel is 22MHz wide. Not all channels 
are available for legal use in every country. For example, channel 14 is authorized 
at present for use only in Japan, and channels 12 and 13 are authorized for use in 
Europe, while the United States permits channels 1 through 11 to be used. Other 
countries may be more restrictive (see Annex J of the 802.11 standard and amend¬ 
ments). Note that policies and licensing requirements may change over time. 
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Figure 3-20 The 802.11b and 802.11g standards use a frequency band between about 2.4GHz and 2.5GHz. This 
band is divided into fourteen 22MHz-wide overlapping channels, of which some subset is gener¬ 
ally available for legal use depending on the country of operation. It is advisable to assign non¬ 
overlapping channels, such as 1, 6, and 11 in the United States, to multiple base stations operating 
in the same area. Only a single 40MHz 802.11n channel may be used in this band without overlap. 


As shown in Figure 3-20, the effect of overlapping channels is now clear. A 
transmitter on channel 1, for example, overlaps with channels 2, 3, 4, and 5 but 
not higher channels. This becomes important when selecting which channels 
to assign for use in environments where multiple access points are to be used 
and even more important when multiple access points serving multiple different 
networks in the same area are to be used. One common approach in the United 
States is to assign up to three APs in an area using nonoverlapping channels 1, 6, 
and 11, as channel 11 is the highest-frequency channel authorized for unlicensed 
use in the United States. In cases where other WLANs may be operating in the 
same bands, it is worth considering jointly planning channel settings with all the 
affected WLAN administrators. 
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As shown in Figure 3-21, 802.11a/n/y share a somewhat more complicated 
channel set but offer a larger number of nonoverlapping channels fo use (i.e., 12 
unlicensed 20MHz channels in fhe Unifed Sfafes). 


133 137 



34 38 42 46 50 58 
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5250 5350 


802.11y Band 
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5490 
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Figure 3-21 Many of the approved 802.11 channel numbers and center frequencies for 20MHz chan¬ 
nels. The most common range for unlicensed use involves the U-NII bands, all above 
5GHz. The lower band is approved for use in most countries. The "Europe" band is 
approved for use in most European countries, and the high band is approved for use in 
the United States and China. Channels are typically 20MHz wide for 802.11a/y but may 
be 40MHz wide for 802.11n. Narrower channels and some channels available in Japan 
are also available (not shown). 


In Figure 3-21, the channels are numbered in 5MHz increments, but different 
channel widths are available: 5MFIz, lOMFIz, 20MHz, and 40MHz. The 40MHz 
channel width is an option with 802.11n (see Section 3.5.4.2), along with several 
proprietary Wi-Fi systems that aggregate two 20MHz channels (called channel 
bonding). 

For typical Wi-Fi networks, an AP has its operating channel assigned during 
installation, and client stations change channels in order to associate with the AP. 
When operating in ad hoc mode, there is no controlling AP, so a station is typically 
hand-configured with the operating channel. The sets of channels available and 
operating power may be constrained by the regulatory environment, the hard¬ 
ware capabilities, and possibly the supporting driver software. 

3.5A.2 802.11 Higher Throughput/802.1In 

In late 2009, the IEEE standardized 802.11n [802.11n-2009] as an amendment to 
[802.11-2007]. It makes a number of important changes to 802.11. To support higher 
throughput, it incorporates support for multiple input, multiple output (MIMO) man¬ 
agement of multiple simultaneously operating data streams carried on multiple 
antennas, called spatial streams. Up to four such spatial streams are supported on a 
given channel. 802.11n channels may be 40MHz wide (using two adjacent 20MHz 
channels), twice as wide as conventional channels in 802.11a/b/g/y. Thus, there 
is an immediate possibility of having up to eight times the maximum data rate of 
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802.11a/g (54Mb/s), for a total of 432Mb/s. However, 802.tin also improves the 
single-stream performance by using a more efficient modulation scheme (802.11n 
uses MIMO- orthogonal frequency division multiplexing (OFDM) with up to 52 
data subcarriers per 20MHz channel and 108 per 40MHz channel, instead of 48 in 
802.11a and 802.11g), plus a more efficient forward error-correcting code (rate 5/6 
instead of 3/4), bringing the per-stream performance to 65Mb/s (20MHz channel) 
or 135Mb/s (40MHz channel). By also reducing the guard interval (GI, a forced 
idle time between symbols) duration to 400ns from the legacy 800ns, the maxi¬ 
mum per-stream performance is raised to about 72.2Mb/s (20MHz channel) and 
150Mb/s (40MHz channel). With four spatial streams operating in concert per¬ 
fectly, this provides a maximum of about 600Mb/s. 

Some 77 combinations of modulation and coding options are supported 
by 802.11n, including 8 options for a single stream, 24 using the same or equal 
modulation (EQM) on all streams, and 43 using unequal modulation (UEQM) on 
multiple streams. Table 3-3 gives some of the combinations for modulation and 
coding scheme according to the first 33 values of the modulation and coding scheme 
(MCS) value. Higher values (33-76) include combinations for two channels (val¬ 
ues 33-38), three channels (39-52), and four channels (53-76). MCS value 32 is a 
special combination where the signals in the two halves of the 40MHz channel 


Table 3-3 MCS values for 802.11n include combinations of equal and unequal modulation, different 
FEC coding rates, up to four spatial streams using 20MHz- or 40MHz-wide channels, and 
an 800ns or 400ns GI. The 77 combinations provide data rates from 6Mb/s to 600Mb/s. 


MCS 

Value 

Modulation Type 

FEC 

Code 

Rate 

Spatial 

Streams 

Rates (Mb/s) 

(20MHz) 

[800/400ns] 

Rates (Mb/s) 

(40MHz) 

[800/400ns] 

0 

BPSK 

1/2 

1 

6.5/72 

13.5/15 

1 

QPSK 

1/2 

1 

13/14.4 

27/30 

2 

QPSK 

3/4 

1 

19.5/21.7 

40.5/45 

3 

16-QAM 

1/2 

1 

26/28.9 

54/60 

4 

16-QAM 

3/4 

1 

39/43.3 

81/90 

5 

64-QAM 

2/3 

1 

52/578 

108/120 

6 

64-QAM 

3/4 

1 

58.5/65 

121.5/135 

7 

64-QAM 

5/6 

1 

65/72.2 

135/150 

8 

BPSK 

1/2 

2 

13/14.4 

27/30 







15 

64-QAM 

5/6 

2 

130/144.4 

270/300 

16 

BPSK 

1/2 

3 

19.5/21.7 

40.5/45 







31 

64-QAM 

5/6 

4 

260/288.9 

540/600 

32 

BPSK 

1/2 

1 

N/A 

6/6.7 







76 

64x3/16xl-QAM 

3/4 

4 

214.5/238.3 

445.5/495 
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contain the same information. Each data rate column gives two values, one using 
the legacy 800ns GI and one giving the greater data rate available using the shorter 
400ns GI. The underlined values, 6Mb/s and 600Mb/s, represent the smallest and 
largest throughput rates, respectively. 

Table 3-3 shows the various combinations of coding, including binary phase shift 
keying (BPSK), quadrature phase shift keying (QPSK), and various levels of quadrature 
amplitude modulation (16- and 64-QAM), available wifh 802.11n. These modulafion 
schemes provide an increasing dafa rafe for a given channel bandwidfh. However, 
fhe more high-performance and complex a modulafion scheme, fhe more vulner¬ 
able if fends fo be fo noise and inferference. Forward error correction (PEG) includes 
a sef of mefhods whereby redundanf bifs are infroduced af fhe sender fhaf can be 
used fo defecf and repair bif errors infroduced during delivery. Eor EEG, fhe code 
rate is fhe rafio of fhe effecfive useful dafa rafe fo fhe rafe imposed on fhe under¬ 
lying communicafion channel. Eor example, a Vi code rafe would deliver 1 useful 
bif for every 2 bifs senf. 

802.11n may operafe in one of fhree modes. In 802.11n-only environmenfs, 
fhe opfional so-called greenfield mode, fhe PLGP confains special bif arrangemenfs 
('Training sequences") known only fo 802.11n equipmenf and does nof inferoperafe 
wifh legacy equipmenf. To mainfain compafibilify, 802.11n has fwo ofher inferoper- 
able modes. However, bofh of fhese impose a performance penalfy fo nafive 802.11n 
equipmenf. One mode, called non-HT mode, essenfially disables all 802.11n feafures 
buf remains compafible wifh legacy equipmenf. This is nof a very inferesfing mode, 
so we shall nof discuss if furfher. However, a required mode called HT-mixed mode 
supporfs bofh 802.11n and legacy operafion, depending on which sfafions are com- 
municafing. The informafion required fo convey an AP's 802.11n capabilify fo HT 
STAs yef profecf legacy STAs is provided in fhe PLGP, which is augmenfed fo con- 
fain bofh HT and legacy informafion and is fransmiffed af a slower rafe fhan in 
greenfield mode so fhaf if can be processed by legacy equipmenf. HT prof ecf ion also 
requires an HT AP fo use self-direcfed GTS frames (or RTS/GTS frame exchanges) 
af fhe legacy rafe fo inform legacy sfafions when if will use shared channels. Even 
fhough RTS/GTS frames are shorf, fhe requiremenf fo send fhem af fhe legacy rafe 
(6Mb/s) can significanfly reduce an 802.11n WLAN's performance. 

When deploying an 802.11n AP, care should be faken fo sef up appropri- 
afe channel assignmenfs. When using 40MHz channels, 802.11n APs should be 
operafed in fhe U-NII bands above 5GHz as fhere is simply nof enough useful 
specfrum fo use fhese wider channels in fhe 2.4GHz ISM band. An opfional BSS 
feafure called phased coexistence operation (PGO) allows an AP fo periodically swifch 
befween 20MHz and 40MHz channel widfhs, which can provide better coexis- 
fence befween 802.11n APs operafing near legacy equipmenf af fhe cosf of some 
addifional fhroughpuf. Pinally, if is worfh menfioning fhaf 802.11n APs generally 
require more power fhan convenfional APs. This higher power level exceeds fhe 
basic 15W provided by 802.3af power-over-Ethernet (PoE) sysfem wiring, meaning 
fhaf PoE+ (802.3af, capable of 30W) should be used unless some ofher form of 
power such as a direcf exfernal power supply is available. 
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3.5.5 Wi-Fi Security 

There has been considerable evolution in the security model for 802.11 networks. 
In its early days, 802.11 used an encryption method known as wired equivalent 
privacy (WEP). WEP was later shown to be so weak that some replacement was 
required. Industry responded with Wi-Fi protected access (WPA), which replaced 
the way keys are used with encrypted blocks (see Chapter 18 for the basics of 
cryptography). In WPA, a scheme called the Temporal Key Integrity Protocol (TKIP) 
ensures, among other things, that each frame is encrypted with a different encryp¬ 
tion key. It also includes a message integrity check, called Michael, that fixed one 
of the major weaknesses in WEP. WPA was created as a placeholder that could be 
used on fielded WEP-capable equipment by way of a firmware upgrade while the 
IEEE 802.11i standards group worked on a stronger standard that was ultimately 
absorbed into Clause 8 of [802.11-2007] and dubbed "WPA2" by industry. Both 
WEP and WPA use the RC4 encryption algorithm [S96]. WPA2 uses the Advanced 
Encryption Standard (AES) algorithm [AESOl]. 

The encryption techniques we just discussed are aimed at providing privacy 
between the station and AP, assuming the station has legitimate authorization to 
be accessing the network. In WEP, and small-scale environments that use WPA 
or WPA2, authorization is typically implemented by pre-placing a shared key 
or password in each station as well as in the AP during configuration. A user 
knowing the key is assumed to have legitimate access to the network. These keys 
are also frequently used to initialize the encryption keys used to ensure privacy. 
Using such pre-shared keys (PSKs) has limitations. Por example, an administrator 
may have considerable trouble in providing keys only to authorized users. If a 
user becomes de-authorized, the PSK has to be replaced and all legitimate users 
informed. This approach does not scale to environments with many users. As a 
result, WPA and later standards support a port-based network access control standard 
called 802.1X [802.1X-2010]. It provides a way to carry the Extensible Authentication 
Protocol (EAP) [RPC3748] in IEEE 802 LANs (called EAPOL), including 802.3 and 
802.11 [REC4017]. EAP, in turn, can be used to carry many other standard and non¬ 
standard authentication protocols. It can also be used to establish keys, including 
WEP keys. Details of these protocols are given in Chapter 18, but we shall also see 
the use of EAP when we discuss PPP in Section 3.6. 

With the completion of the IEEE 802.11i group's work, the RC4/TKIP combina¬ 
tion in WPA was extended with a new algorithm called CCMP as part of WPA2. 
CCMP is based on using the counter mode (CCM [RPC3610]) of the AES for confi¬ 
dentiality with cipher block chaining message authentication code (CBC-MAC; note the 
"other" use of the term MAC here) for authentication and integrity. All AES pro¬ 
cessing is performed using a 128-bit block size and 128-bit keys. CCMP and TKIP 
form the basis for a Wi-Pi security architecture named the Robust Security Network 
(RSN), which supports Robust Security Network Access (RSNA). Earlier methods, 
such as WEP, are called pre-RSNA methods. RSNA compliance requires support 
for CCMP (TKIP is optional), and 802.11n does away with TKIP entirely. Table 3-4 
provides a summary of this somewhat complicated situation. 
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Table 3-4 Wi-Fi security has evolved from WEP, which was found to be insecure, to WPA, to the 
now-standard WPA2 collection of algorithms. 


Name/Standard 

Cipher 

Key Stream Management 

Authentication 

WEP (pre-RSNA) 

RC4 

(WEP) 

PSR, (802.1X/EAP) 

WPA 

RC4 

TRIP 

PSR, 802.1X/EAP 

WPA2/802.11(i) 

CCMP 

CCMP, (TRIP) 

PSR, 802.1X/EAP 


In all cases, both pre-shared keys as well as 802.1X can be used for authentica¬ 
tion and initial keying. The major attraction of using 802.1X/EAP is that a managed 
authentication server can be used to provide access control decisions on a per-user 
basis to an AP. For this reason, authentication using 802.1X is sometimes referred to 
as "Enterprise" (e.g., WPA-Enterprise). EAP itself can encapsulate various specific 
authentication protocols, which we discuss in more detail in Chapter 18. 

3.5.6 Wi-Fi Mesh (802.11s) 

The IEEE is working on the 802.11s standard, which covers Wi-Fi mesh operation. 
With mesh operation, wireless stations can act as data-forwarding agents (like 
APs). The standard is not yet complete as of writing (mid-2011). The draft version 
of 802.11s defines the Hybrid Wireless Routing Protocol (HWRP), based in part on the 
IETF standards for Ad-Hoc On-Demand Distance Vector (AODV) routing [RFC3561] 
and the Optimized Link State Routing (OLSR) protocol [RFC3626]. Mesh stations 
(mesh STAs) are a type of QoS STA and may participate in HWRP or other routing 
protocols, but compliant nodes must include an implementation of HWRP and the 
associated airtime link metric. Mesh nodes coordinate using EDCA or may use an 
optional coordinating function called mesh deterministic access. Mesh points (MPs) 
are those nodes that form mesh links with neighbors. Those that also include AP 
functionality are called mesh APs (MAPs). Conventional 802.11 stations can use 
either APs or MAPs to access the rest of the wireless LAN. 

The 802.11s draft specifies a new optional form of security for RSNA called 
Simultaneous Authentication of Equals (SAE) authentication [SAE]. This security 
protocol is a bit different from others because it does not require lockstep opera¬ 
tion between a specially designated initiator and responder. Instead, stations 
are treated as equals, and any station that first recognizes another may initiate a 
security exchange (or this may happen simultaneously as two stations initiate an 
association). 


3.6 Point-to-Point Protocol (PPP) 

PPP stands for the Point-to-Point Protocol [RFC1661][RFC1662][RFC2153]. It is a pop¬ 
ular method for carrying IP datagrams over serial links—from low-speed dial-up 
modems to high-speed optical links [RFC2615]. It is widely deployed by some DSL 
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service providers, which also use it for assigning Internet system parameters (e.g., 
initial IP address and domain name server; see Chapter 6). 

PPP should be considered more of a collection of protocols than a single pro¬ 
tocol. It supports a basic method to establish a link, called the Link Control Proto¬ 
col (LCP), as well as a family of NCPs, used to establish network-layer links for 
various kinds of protocols, including IPv4 and IPv6 and possibly non-IP protocols, 
after LCP has established the basic link. A number of related standards cover con¬ 
trol of compression and encryption for PPP, and a number of authentication meth¬ 
ods can be employed when a link is brought up. 

3.6.1 Link Control Protocol (LCP) 

The LCP portion of PPP is used to establish and maintain a low-level two-party 
communication path over a point-to-point link. PPP's operation therefore need 
be concerned only with two ends of a single link; it does not need to handle the 
problem of mediating access to a shared resource like the MAC-layer protocols of 
Ethernet and Wi-Fi. 

PPP generally, and LCP more specifically, imposes minimal requirements on 
the underlying point-to-point link. The link must support bidirectional operation 
(LCP uses acknowledgments) and operate either asynchronously or synchro¬ 
nously. Typically, LCP establishes a link using a simple bit-level framing format 
based on the High-Level Data Link Control (HDLC) protocol. HDLC was already 
a well-established framing format by the time PPP was designed [ISO3309] 
[IS04335]. IBM modified it to form Synchronous Data Link Control (SDLC), a pro¬ 
tocol used as the link layer in its proprietary System Network Architecture (SNA) 
protocol suite. HDLC was also used as the basis for the LLC standard in 802.2 and 
ultimately for PPP as well. The format is shown in Figure 3-22. 


FCS Coverage 


Flag 

(0x7E) 

Addr 

(OxFF) 

Control 

(0x03) 

Protocol 

Data 

(PPP Control or Network-Layer Data) 

Pad 

(0 If Present) 

PCS 
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(0x7E) 

(1) 

(1) 

{2 bytes) 
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- ► 

(2 or 4) 



Figure 3-22 The PPP basic frame format was borrowed from HDLC. It provides a protocol identifier, payload 
area, and 2- or 4-byte FCS. Other fields may or may not be present, depending on compression 
options. 


The PPP frame format, in the common case when HDLC-like framing is used 
as shown in Figure 3-22, is surrounded by two 1-byte Flag fields containing the 
fixed value 0x7E. These fields are used by the two stations on the ends of the 
point-to-point link for finding the beginning and end of the frame. A small prob¬ 
lem arises if the value 0x7E itself occurs inside the frame. This is handled in one of 
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two ways, depending on whether PPP is operating over an asynchronous or a syn¬ 
chronous link. For asynchronous links, PPP uses character stuffing (also called byte 
stuffing). If the flag characfer appears elsewhere in fhe frame, if is replaced wifh 
fhe 2-byfe sequence 0x7D5E (0x7D is known as fhe "PPP escape characfer"). If fhe 
escape characfer ifself appears in fhe frame, if is replaced wifh fhe 2-byfe sequence 
0x7D5D. Thus, fhe receiver replaces 0x7D5E wifh 0x7E and 0x7D5D wifh 0x7D 
upon receipf. On synchronous links (e.g., T1 lines, T3 lines), PPP uses bit stuffing. 
Nofing fhaf fhe flag characfer has fhe bif paffern 01111110 (a configuous sequence 
of six 1 bifs), bif sfuffing arranges for a 0 bif fo be inserfed affer any configuous 
sfring of five 1 bifs appearing in a place ofher fhan fhe flag characfer ifself. Doing 
so implies fhaf byfes may be senf as more fhan 8 bifs, buf fhis is generally OK, as 
low layers of fhe serial processing hardware are able fo "unsfuff" fhe bif sfream, 
resforing if fo ifs presfuffed paffern. 

Affer fhe firsf Flag field, PPP adopfs fhe HDLC Address (Addr) and Control 
fields. In HDLC, fhe Address field would specify which sfafion is being addressed, 
buf because PPP is concerned only wifh a single desfinafion, fhis field is always 
defined fo have fhe value OxPP (all sfafions). The Control field in HDLC is used fo 
indicafe frame sequencing and refransmission behavior. As fhese link-layer reli- 
abilify funcfions are nof ordinarily implemenfed by PPP, fhe Control field is sef 
fo fhe fixed value 0x03. Because bofh fhe Address and Control fields are fixed con- 
sfanfs in PPP, fhey are offen omiffed during fransmission wifh an opfion called 
Address and Control Field Compression (ACPC), which essenfially eliminafes fhe fwo 
fields. 


Note 

There has been considerable debate over the years as to how much reliability 
link-layer networks should provide, if any. With Ethernet, up to 16 retransmis¬ 
sion attempts are made before giving up. Typically, PPP is configured to do no 
retransmission, although there do exist specifications for adding retransmission 
[RFC1663]. The trade-off can be subtle and is dependent on the types of traffic to 
be carried. A detailed discussion of the considerations is contained in [RFC3366]. 


The Protocol field of fhe PPP frame indicafes fhe type of data being carried. 
Many different types of protocols can be carried in a PPP frame. The official list 
and the assigned number used in the Protocol field are given by the "Point-to-Point 
Protocol Pield Assignments" document [PPPn]. In conforming to the HDLC speci¬ 
fication, any protocol numbers are assigned such that the least significant bit of the 
most significant byte equals 0 and the least significant bit of the least significant 
byte equals 1. Values in the (hexadecimal) range 0x0000-0x3PPP identify network- 
layer protocols, and values in the OxSOOO-OxBPPP range identify data belonging to 
an associated NCP Protocol values in the range 0x4000-0x7PPP are used for "low- 
volume" protocols with no associated NCP. Protocol values in the range OxCOOO- 
OXEPPP identify control protocols such as LCP. In some circumstances the Protocol 
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field can be compressed to a single byte, if the Protocol Field Compression (PFC) 
option is negotiated successfully during link establishment. This is applicable to 
protocols with protocol numbers in the range OxOOOO-OxOOFF, which includes most 
of the popular network-layer protocols. Note, however, that TCP packets always 
use the 2-byte uncompressed format. 

The final portion of the PPP frame contains a 16-bit PCS (a CRC16, with gener¬ 
ator polynomial 10001000000100001) covering the entire frame except the PCS field 
itself and Flag bytes. Note that the PCS value covers the frame before any byte or 
bit stuffing has been performed. With an TCP option (see Section 3.6.1.2), the CRC 
can be extended from 16 to 32 bits. This case uses the same CRC32 polynomial 
mentioned previously for Ethernet. 

3.6.1.1 LCP Operation 

TCP has a simple encapsulation beyond the basic PPP packet. It is illustrated in 
Figure 3-23. 


PPP Packet 


Flag 

(0x7E) 

Addr 

(OxFF) 

Control 

(0x03) 

Protocol 

(0xC021) 

Code 

Ident 

Length 

LCP Data 

Pad 

(If Present) 

FCS 

Flag 

(0x7E) 

(1) 

(1) 

{2 bytes) 

(1 or 2) 

(1) 

(1) 

(2) 

(variable) 

(variable) 

(2 or 4) 



LCP Packet 


Figure 3-23 The LCP packet is a fairly general format capable of identifying the type of encapsulated data and 
its length. LCP frames are used primarily in establishing a PPP link, but this basic format also 
forms the basis of many of the various network control protocols. 


The PPP Protocol field value for LCP is always 0xC021, which is not eliminated 
using PFC, so as to minimize ambiguity. The Ident field is a sequence number 
provided by the sender of LCP request frames and is incremented for each sub¬ 
sequent message. When forming a reply (ACK, NACK, or REJECT response), this 
field is constructed by copying the value included in the request to the response 
packet. In this fashion, the requesting side can identify replies to the appropriate 
request by matching identifiers. The Code field gives the type of operation being 
either requested or responded to: configure-request (0x01), configure-ACK (0x02), 
configure-NACK (0x03), configure-REJECT (0x04), terminate-request (0x05), ter- 
minate-ACK (0x06), code-REJECT (0x07), protocol-REJECT (0x08), echo-request 
(0x09), echo-reply (OxOA), discard-request (OxOB), identification (OxOC), and time- 
remaining (OxOD). Generally, ACK messages indicate acceptance of a set of options, 
and NACK messages indicate a partial rejection with suggested alternatives. A 
REJECT message rejects one or more options entirely. A rejected code indicates 
that one of the field values contained in a previous packet is unknown. The Length 
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field gives the length of the LCP packet in bytes and is not permitted to exceed the 
link's maximum received unit (MRU), a form of maximum advised frame limit we 
shall discuss later. Note that the Length field is part of the LCP protocol; the PPP 
protocol in general does not provide such a field. 

The main job of LCP is to bring up a point-to-point link to a minimal level. 
Configure messages cause each end of the link to start the basic configuration pro¬ 
cedure and establish agreed-upon options. Termination messages are used to clear 
a link when complete. LCP also provides some additional features mentioned pre¬ 
viously. Echo Request/Reply messages may be exchanged anytime a link is active 
by LCP in order to verify operation of the peer. The Discard Request message can 
be used for performance measurement; it instructs the peer to discard the packet 
without responding. The Identification and Time-Remaining messages are used for 
administrative purposes: to know the type of the peer system and to indicate the 
amount of time allowed for the link to remain established (e.g., for administrative 
or security reasons). 

Historically, one common problem with point-to-point links occurs if a remote 
station is in loopback mode or is said to be "looped." Telephone company wide area 
data circuits are sometimes put into loopback mode for testing—data sent at one 
side is simply returned from the other. Although this may be useful for line test¬ 
ing, it is not at all helpful for data communication, so LCP includes ways to send 
a magic number (an arbitrary number selected by the sender) to see if it is immedi¬ 
ately returned in the same message type. If so, the line is detected as being looped, 
and maintenance is likely required. 

To get a better feeling for how PPP links are established and options are nego¬ 
tiated, Figure 3-24 illustrates a simplified packet exchange timeline as well as a 
simplified state machine (implemented at both ends of the link). 

The link is considered to be established once the underlying protocol layer 
has indicated that an association has become active (e.g., carrier detected for 
modems). Link quality testing, which involves an exchange of link quality reports 
and acknowledgments (see Section 3.6.1.2), may also be accomplished during this 
period. If the link requires authentication, which is common, for example, when 
dialing in to an ISP, a number of additional exchanges may be required to estab¬ 
lish the authenticity of one or both parties attached to the link. The link is termi¬ 
nated once the underlying protocol or hardware has indicated that the association 
has stopped (e.g., carrier lost) or after having sent a link termination request and 
received a termination ACK from the peer. 

3.6.1.2 LCP Options 

Several options can be negotiated by LCP as it establishes a link for use by one or 
more NCPs. We shall discuss two of the more common ones. The Asynchronous 
Control Character Map (ACCM) or simply "asyncmap" option defines which control 
characters (i.e., ASCII characters in the range OxOO-OxlF) need to be "escaped" as 
PPP operates. Escaping a character means that the true value of the character is 
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Peer 1 Messages Peer 2 

Exchanged 




Figure 3-24 LCP is used to establish a PPP link and agree upon options by each peer. The typical 
exchange involves a pair of configure requests and ACKs that contain the option list, 
an authentication exchange, data exchange (not pictured), and a termination exchange. 
Because PPP is such a general-purpose protocol with many parts, many other types of 
operations may occur between the establishment of a link and its termination. 


not sent, but instead the PPP escape character (0x7D) is stuffed in fronf of a value 
formed by XORing fhe original confrol characfer wifh fhe value 0x20. For exam¬ 
ple, fhe XOFF characfer (0x13) would be senf as (0x7D33). ACCM is used in cases 
where confrol characfers may affecf fhe operafion of fhe underlying hardware. 
For example, if soffware flow confrol using XON/XOFF characfers is enabled and 
fhe XOFF characfer is passed fhrough fhe link unescaped, fhe dafa fransfer ceases 
unfil fhe hardware observes an XON characfer. The asyncmap opfion is generally 
specified as a 32-bif hexadecimal number where a 1 bif in fhe nfh leasf significanf 
bif posifion indicafes fhaf fhe confrol characfer wifh value n should be escaped. 
Thus, fhe asyncmap Oxffffffff would escape all confrol characfers, 0x00000000 
would escape none of fhem, and OxOOOAOOOO would escape XON (value 0x11) and 
XOFF (value 0x13). Alfhough fhe value Oxffffffff is fhe specified defaulf, many 
links foday can operafe safely wifh fhe asyncmap sef fo 0x00000000. 
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Because PPP lacks a Length field and serial lines do not typically provide fram¬ 
ing, no immediate hard limit is set on the length of a PPP frame, in theory. In prac¬ 
tice, some maximum frame size is typically given by specifying the MRU. When 
a host specifies an MRU option (type 0x01), the peer is requested to never send 
frames longer than the value provided in the MRU option. The MRU value is the 
length of the data field in bytes; it does not count the various other PPP overhead 
fields (i.e.. Protocol, PCS, Flag fields). Typical values are 1500 or 1492 but may be as 
large as 65,535. A minimum of 1280 is required for IPv6 operations. The standard 
requires PPP implementations to accept frames as large as 1500 bytes, so the MRU 
serves more as advice to the peer in choosing the packet size than as a hard limit 
on the size. When small packets are interleaved with larger packets on the same 
PPP link, the larger packets may use most of the bandwidth of a low-bandwidth 
link, to the detriment of the small packets. This can lead to jitter (delay variance), 
negatively affecting interactive applications such as remote login and VoIP. Con¬ 
figuring a smaller MRU (or MTU) can help mitigate this issue at the cost of higher 
overhead. 

PPP supports a mechanism to exchange link quality reporting information. 
During option negotiation, a configuration message including a request for a par¬ 
ticular quality protocol may be included. Sixteen bits of the option are reserved to 
specify the particular protocol, but the most common is a PPP standard involving 
Link Quality Reports (LQRs) [RFC1989], using the value 0xC025 in the PPP Protocol 
field. If this is enabled, the peer is asked to provide LQRs at some periodic rate. 
The maximum time between LQRs requested is encoded as a 32-bit number pres¬ 
ent in the configuration option and expressed in 1/lOOs units. Peers may generate 
LQRs more frequently than requested. LQRs include the following information: 
a magic number, the number of packets and bytes sent and received, the number 
of incoming packets with errors and the number of discarded packets, and the 
total number of LQRs exchanged. A typical implementation allows the user to 
configure how often LQRs are requested from the peer. Some also provide a way 
to terminate the link if the quality history fails to meet some configured threshold. 
LQRs may be requested after the PPP link has reached the Establish state. Each 
LQR is given a sequence number, so it is possible to determine trends over time, 
even in the face of reordering of LQRs. 

Many PPP implementations support a callback capability. In a typical callback 
setup, a PPP dial-up callback client calls in to a PPP callback server, authentica¬ 
tion information is provided, and the server disconnects and calls the client back. 
This may be useful in situations where call toll charges are asymmetric or for 
some level of security. The protocol used to negotiate callback is an TCP option 
with value OxOD [RFC1570]. If agreed upon, the Callback Control Protocol (CBCP) 
completes the negotiation. 

Some compression and encryption algorithms used with PPP require a cer¬ 
tain minimum number of bytes, called the block size, when operating. When data is 
not otherwise long enough, padding may be added to cause the length to become 
an even multiple of the block size. If present, padding is included beyond the data 
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area and prior to the PPP FCS field. A padding method known as self-describing 
padding [RFC1570] alters the value of padding to be nonzero. Instead, each byte 
gets the value of its offset in the pad area. Thus, the first byte of pad would have 
the value 0x01, and the final byte contains the number of pad bytes that were 
added. At most, 255 bytes of padding are supported. The self-describing padding 
option (type 10) indicates to a peer the ability to understand this form of padding 
and includes the maximum pad value (MPV), which is the largest pad value allowed 
for this association. Recall that the basic PPP frame lacks an explicit Length field, 
so a receiver can use self-describing padding to determine how many pad bytes 
should be trimmed from the received data area. 

To lessen the impact of the fixed costs of sending a header on every frame, a 
method has been introduced to multiplex multiple distinct payloads of potentially 
different protocols into the same PPP frame, an approach called PPPMux [RFC3153]. 
The primary PPP header Protocol field is set to multiplexed frame (0x0059), and then 
each payload block is inserted into the frame. This is accomplished by introduc¬ 
ing a 1- to 4-byte subframe header in front of each payload block. It includes 1 bit 
(called PFF) indicating whether a Protocol field is included in the subframe header 
and another 1-bit field (called TXT) indicating whether the following Length field 
is 1 or 2 bytes. Beyond this, if present, is the 1- or 2-byte Protocol ID using the same 
values and same compression approach as with the outer PPP header. A 0 value for 
PFF (meaning no PID field is present) is possible when the subframe matches the 
default PID established when the configuration state is set up using the PPPMux 
Control Protocol (PPPMuxCP). 

The PPP frame format in Figure 3-19 indicates that the ordinary PPP/HDLC 
FCS can be either 16 or 32 bits. While the default is 16, 32-bit FCS values can be 
enabled with the 32-bit FCS option. Other TCP options include the use of PFC and 
ACFC, and selection of an authentication algorithm. 

Internationalization [RFC2484] provides a way to convey the language and 
character set to be used. The character set is one of the standard values from the 
"charset registry" [IANA-CHARSET], and the language value is chosen from the 
list in [RFC5646][RFC4647]. 

3.6.2 Multilink PPP (MP) 

A special option to PPP called multilink PPP (MP) [RFC1990] can be used to 
aggregate multiple point-to-point links to act as one. This idea is similar to link 
aggregation, discussed earlier, and has been used for aggregating multiple cir¬ 
cuit-switched channels together (e.g., ISDN B channels). MP includes a special 
LCP option to indicate multilink support as well as a negotiation protocol to frag¬ 
ment and recombine fragmented PPP frames across multiple links. An aggregated 
link, called a bundle, operates as a complete virtual link and can contain its own 
configuration information. The bundle comprises a number of member links. Each 
member link may also have its own set of options. 
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The obvious method to implement MP would be to simply alternate pack¬ 
ets across the member links. This approach, called the bank teller's algorithm, may 
lead to reordering of packets, which can have undesirable performance impacfs on 
ofher protocols. (Alfhough TCP/IP, for example, can funcfion properly wifh reor¬ 
dered packefs, if may nof funcfion as well as if could wifhouf reordering.) Instead, 
MP places a 2- or 4-byfe sequencing header in each packef, and fhe remofe MP 
receiver is fasked wifh reconsfrucfing fhe proper order. The dafa frame appears as 
shown in Figure 3-25. 
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Figure 3-25 An MP fragment contains a sequencing header that allows the remote end of a multilink bundle 
to reorder fragments. Two formats of this header are supported: a short header (2 bytes) and a 
long header (4 bytes). 


In Figure 3-25 we see an MP fragment with the begin (B) and end (E) fragment 
bit fields and Sequence Number field. Note that there is both a long format, in which 
4 bytes are used for the fragmentation information, and a short format, in which 
only 2 bytes are used. The format being used is selected during option negotiation 
using the LCP short sequence number option (type 18). If a frame is not fragmented 
but is carried in this format, both the B and E bits are set, indicating that the frag¬ 
ment is the first and last (i.e., it is the whole frame). Otherwise, the first fragment 
has the BE bit combination set to 10 and the final fragment has the BE bits set to 
01, and all fragments in between have them set to 00. The sequence number then 
gives the packet number offset relative to the first fragment. 

Use of MP is requested by including an LCP option called the multilink maxi¬ 
mum received reconstructed unit (MRRU, type 18) that can act as a sort of larger MRU 
applying to the bundle. Frames larger than any of the member link MRUs may 
still be permitted across the MP link, up to the limit advertised in this value. 

Because an MP bundle may span multiple member links, a method is needed 
to identify member links as belonging to the same bundle. Member links in the 
same bundle are identified by the LCP endpoint discriminator option (type 19). The 
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endpoint discriminator could be a phone number, a number derived from an IP or 
MAC address, or some administrative string. Other than being common to each 
member link, there are few restrictions on the form of this option. 

The basic method of establishing MP as defined in [RFC1990] expects that 
member links are going to be used symmetrically—about the same number of 
fragments will be allocated to each of a fixed number of links. In order to achieve 
more sophisticated allocations than this, the Bandwidth Allocation Protocol (BAP) 
and Bandwidth Allocation Control Protocol (BACP) are specified in [RFC2125]. BAP 
can be used to dynamically add or remove links from a bundle, and BACP can be 
used to exchange information regarding how links should be added or removed 
using BAP. This capability can be used to help implement bandwidth on demand 
(BOD). In networks where some fixed resource needs to be allocated in order to 
meet an application's need for bandwidth (e.g., by dialing some number of tele¬ 
phone connections), BOD typically involves monitoring traffic and creating new 
connections when usage is high and shutting down connections when usage is 
low. This is useful, for example, in cases where some monetary charge is associ¬ 
ated with the number of connections being used. 

BAP/BACP makes use of a new link discriminator TCP option (TCP option type 
23). This option contains a 16-bit numeric value that is required to be different for 
each member link of a bundle. It is used by BAP to identify which links are to be 
added or removed. BACP is negotiated once per bundle during the network phase 
of a PPP link. Its main purpose is to identify a favored peer. That is, if more than one 
bundle is being set up simultaneously among multiple peers, the favored peer is 
preferentially allocated member links. 

BAP includes three packet types: request, response, and indication. Requests 
are to add a link to a bundle or to request the peer to delete a link from a bundle. 
Indications convey the results of attempted additions back to the original requester 
and are acknowledged. Responses are either ACKs or NACKs for these requests. 
More details can be found in [RFC2125]. 

3.6.3 Compression Control Protocol (CCP) 

Flistorically, PPP has been the protocol of choice when using relatively slow dial¬ 
up modems. As a consequence, a number of methods have been developed to 
compress data sent over PPP links. This type of compression is distinct both from 
the types of compression supported in modem hardware (e.g., V.42bis, V.44) and 
also from protocol header compression, which we discuss later. Today, several com¬ 
pression options are available. To choose among them for each direction on a PPP 
link, TCP can negotiate an option to enable the Compression Control Protocol (CCP) 
[RFC1962]. CCP acts like an NCP (see Section 3.6.5) but handles the details of con¬ 
figuring compression once the compression option is indicated in the TCP link 
establishment exchange. 

In behaving like an NCP, CCP can be negotiated only once the link has entered 
the Network state. It uses the same packet exchange procedures and formats as 
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LCP, except the Protocol field is set to OxSOFD, there are some special options, and 
in addition to the common Code field values (1-7) fwo new operafions are defined: 
resef-requesf (OxOe) and resef-ACK (OxOf). If an error is defecfed in a compressed 
frame, a resef requesf can be used fo cause fhe peer fo resef compression sfafe 
(e.g., dicfionaries, sfafe variables, sfafe machines, efc.). Affer reseffing, fhe peer 
responds wifh a resef-ACK. 

One or more compressed packefs may be carried wifhin fhe informafion por- 
fion of a PPP frame (i.e., fhe porfion including fhe LCP dafa and possibly pad 
porfions). Compressed frames carry fhe Protocol field value of OxOOFD, buf fhe 
mechanism used fo indicafe fhe presence of mulfiple compressed dafagrams is 
dependenf on fhe parficular compression algorifhm used (see Secfion 3.6.6). When 
used in conjuncfion wifh MP, CCP may be used eifher on fhe bundle or on some 
combinafion of fhe member links. If used only on member links, fhe Protocol field 
is sef fo OxOOFB (individual link compressed dafagram). 

CCP can enable one of abouf a dozen compression algorifhms [PPPn]. Mosf 
of fhe algorifhms are nof official sfandards-frack IETF documenfs, alfhough fhey 
may be described in informafional RFCs (e.g., [RFC1977] describes fhe BSD com¬ 
pression scheme, and [RFC2118] describes fhe Microsoft Point-to-Point Compres¬ 
sion Protocol (MPPC)). If compression is being used, PPP frames are reconsfrucfed 
before furfher processing, so higher-layer PPP operafions are nof generally con¬ 
cerned wifh fhe defails of fhe compressed frames. 

3.6.4 PPP Authentication 

Before a PPP link becomes operafional in fhe Nefwork sfafe, if is offen necessary fo 
esfablish fhe idenfify of fhe peer(s) of fhe link using some authentication (idenfify 
verificafion) mechanism. The basic PPP specificafion has a defaulf of no aufhen- 
ficafion, so fhe aufhenficafion exchange of Figure 3-24 would nof be used in such 
cases. More offen, however, some form of aufhenficafion is required, and a num¬ 
ber of profocols have evolved over fhe years fo deal wifh fhis sifuafion. In fhis 
chapfer we discuss fhem only from a high-level poinf of view and leave fhe defails 
for fhe chapfer on securify (Chapfer 18). Ofher fhan no aufhenficafion, fhe sim- 
plesf and leasf secure aufhenficafion scheme is called fhe Password Authentication 
Protocol (PAP). This protocol is very simple—one peer requesfs fhe ofher fo send a 
password, and fhe password is so provided. As fhe password is senf unencrypfed 
over fhe PPP link, any eavesdropper on fhe line can simply capfure fhe password 
and use if later. Because of fhis significanf vulnerabilify, PAP is nof recommended 
for aufhenficafion. PAP packefs are encoded as LCP packefs wifh fhe Protocol field 
value sef fo 0xC023. 

A somewhaf more secure approach fo aufhenficafion is provided by fhe Chal¬ 
lenge-Handshake Authentication Protocol (CFIAP) [RFC1994]. Using CFIAP, a random 
value is senf from one peer (called fhe aufhenficafor) fo fhe ofher. A response is 
formed by using a special one-way (i.e., nof easily inverfible) funcfion fo combine 
fhe random value wifh a shared secref key (usually derived from a password) 
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to produce a number that is sent in response. Upon receiving this response, the 
authenticator can determine with a very high degree of confidence fhaf ifs peer 
possesses fhe correcf secref key. This protocol never sends fhe key or password 
over fhe link in a clear (unencrypted) form, so any eavesdropper is unable fo learn 
fhe secref. Because a differenf random value is used each fime, fhe resulf of fhe 
funcfion changes for each challenge/response, so fhe values an eavesdropper may 
be able fo capfure cannof be reused (played back) fo impersonate fhe peer. How¬ 
ever, CHAP is vulnerable fo a "man in fhe middle" form of affack (see Chapfer 18). 

EAP [RFC3748] is an aufhenficafion framework available for many differenf 
nefwork fypes. If also supporfs many (abouf 40) differenf aufhenficafion mefhods, 
ranging from simple passwords such as PAP and CHAP fo more elaborafe fypes 
of aufhenficafion (e.g., smarf cards, biomefrics). EAP defines a message formaf 
for carrying a variefy of specific fypes of aufhenficafion formafs, buf addifional 
specificafions are needed fo define how EAP messages are carried over parficular 
fypes of links. 

When EAP is used wifh PPP, fhe basic aufhenficafion mefhod discussed so 
far is alfered. Insfead of negofiafing a specific aufhenficafion mefhod early in fhe 
link esfablishmenf (af LCP link esfablishmenf), fhe aufhenficafion operafion may 
be posfponed unfil fhe Aufh sfafe (jusf before fhe Nefwork sfafe). This allows for 
a greater richness in fhe fypes of informafion fhaf can be used fo influence access 
confrol decisions by remote access servers (RASs). When fhere is a sfandard profocol 
for carrying a variefy of aufhenficafion mechanisms, a nefwork access server may 
nof need fo process fhe confenfs of EAP messages af all buf can insfead depend on 
some ofher infrasfrucfure aufhenficafion server (e.g., a RADIUS server [RFC2865]) 
fo defermine access confrol decisions. This is currenfly fhe design of choice for 
enterprise nefworks and ISPs. 

3.6.5 Network Control Protocols (NCPs) 

Although many different NCPs can be used on a PPP link (even simultaneously), 
we shall focus on the NCPs supporting IPv4 and IPv6. For IPv4, the NCP is called 
the IP Control Protocol (IPCP) [RFC1332]. For IPv6, the NCP is IPV6CP [RFC5072]. 
Once LCP has completed its link establishment and authentication, each end of the 
link is in the Network state and may proceed to negotiate a network-layer associa¬ 
tion using zero or more NCPs (one, such as IPCP, is typical). 

IPCP, the standard NCP for IPv4, canbe used to establish IPv4 connectivity over 
a link and configure Van Jacobson header compression (VJ compression) [RFC1144]. 
IPCP packets may be exchanged after the PPP state machine has reached the Net¬ 
work state. IPCP packets use the same packet exchange mechanism and packet 
format as LCP, except the Protocol field is set to 0x8021, and the Code field is limited 
to the range 0-7. These values of the Code field correspond to the message types: 
vendor-specific (see [RFC2153]), configure-request, configure-ACK, configure- 
REJECT, terminate-request, terminate-ACK, and code-REJECT. IPCP can negotiate 
a number of options, including an IP compression protocol (2), the IPv4 address 
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(3), and Mobile IPv4 [RFC2290] (4). Other options are available for learning the 
location of primary and secondary domain name servers (see Chapfer 11). 

IPV6CP uses fhe same packef exchange and formal as LCP, excepf if has fwo 
differenf opfions: inferface-idenfifier and IPv6-compression-profocol. The infer- 
face idenfifier opfion is used fo convey a 64-bif IID value (see Chapfer 2) used as 
fhe basis for forming a link-local IPv6 address. Because if is used only on fhe local 
link, if does nof require global uniqueness. This is accomplished using a sfandard 
link-local prefix for fhe higher-order bifs of fhe IPv6 address and allowing fhe 
lower-order bifs fo be a funcfion of fhe inferface idenfifier. This mimics IPv6 aufo- 
configurafion (see Chapfer 6). 

3.6.6 Header Compression 

PPP dial-up lines have historically been comparafively slow (54,000 bifs/s or less), 
and many small packefs are often used wifh TCP/IP (e.g., for TCP's acknowledg- 
menfs; see Chapfer 15). Mosf of fhese packefs confain a TCP and IP header fhaf 
changes liffle from one packef fo anofher on fhe same TCP connecfion. Ofher 
higher-layer protocols behave similarly. Thus, if is useful fo have a way of com¬ 
pressing fhe headers of fhese higher-layer protocols (or eliminafing fhem) so fhaf 
fewer byfes need fo be carried over relafively slow poinf-fo-poinf links. The mefh- 
ods employed fo compress or eliminafe headers have evolved over fime. We discuss 
fhem in chronological order, beginning wifh VJ compression, menfioned earlier. 

In VJ compression, porfions of fhe higher-layer (TCP and IP) headers are 
replaced wifh a small, 1-byfe connecfion idenfifier. [RFC1144] discusses fhe origin 
of fhis approach, using an older poinf-fo-poinf protocol called CSLIP (Compressed 
Serial Line IP). A fypical IPv4 header is 20 byfes, and a TCP header wifhouf opfions 
is anofher 20. Togefher, a common combined TCP/IPv4 header is fhus 40 byfes, 
and many of fhe fields do nof change from packef fo packef. Furfhermore, many 
of fhe fields fhaf do change from packef fo packef change only slighfly or in a 
limited way. When fhe nonchanging values are senf over a link once (or a small 
number of fimes) and kepf in a fable, a small index can be used as a replacemenf 
for fhe consfanfs in subsequenf packefs. The limited changing values are fhen 
encoded differenfially (i.e., only fhe amounf of change is senf). As a resulf, fhe 
enfire 40-byfe header can usually be compressed fo an effecfive 3 or 4 byfes. This 
can significanfly improve TCP/IP performance over slow links. 

The nexf sfep in fhe evolufion of header compression is simply called IP header 
compression [RFC2507][RFC3544]. If provides a way fo compress fhe headers of 
mulfiple packefs using bofh TCP or UDP fransporf-layer protocols and eifher IPv4 
or IPv6 nefwork-layer profocols. The fechniques are a logical extension and gen- 
eralizafion of fhe VJ compression fechnique fhaf applies fo more profocols, and fo 
links ofher fhan PPP links. [RFC2507] poinfs ouf fhe necessify of some sfrong error 
defecfion mechanism in fhe underlying link layer because erroneous packefs can 
be consfrucfed af fhe egress of a link if compressed header values are damaged in 
fransif. This is imporfanf fo recognize when header compression is used on links 
fhaf may nof have as sfrong an PCS compufafion as PPP. 
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The most recent step in the evolution of header compression is known as 
Robust Header Compression (ROHC) [RFC5225]. It further generalizes IP header 
compression to cover more transport protocols and allows more than one form 
of header compression to operate simultaneously. Like the IP header compression 
mentioned previously, it can be used over various types of links, including PPP. 

3.6.7 Example 

We now look at the debugging output of a PPP server interacting with a client 
over a dial-in modem. The dialing-in client is an IPv6-capable Microsoft Windows 
Vista machine, and the server is Linux. The Vista machine is configured to negoti¬ 
ate multilink capability even on single links (Properties I Options I PPP Settings), 
for demonstration purposes, and the server is configured to require an encryption 
protocol negotiated using CCP (see MPPE in the following listing): 


data dev=ttyS0, pid=28280, caller='none', conn='38400', 
name='',cmd='/usr/sbin/pppd', user='/AutoPPP/' 
pppd 2.4.4 started by a_ppp, uid 0 
using channel 54 
Using interface pppO 
PPpO <--> /dev/ttySO 

sent [LCP ConfReg id=0xl <asyncmap 0x0> <auth eap> 

<magic 0xa5ccc449><pcomp> <accomp>] 
rcvd [LCP ConfNak id=0xl <auth chap MS-v2>] 
sent [LCP ConfReg id=0x2 <asyncmap 0x0> <auth chap MS-v2> 

<magic 0xa5ccc449><pcomp> <accomp>] 
rcvd [LCP ConfAck id=0x2 <asyncmap 0x0> <auth chap MS-v2> 

<magic 0xa5ccc449><pcomp> <accomp>] 
rcvd [LCP ConfReg id=0x2 <asyncmap 0x0> <magic 0xa531e06> 

<pcomp> <accomp><callback CBCP> <mrru 1614> 

<endpoint [local:12.92.67.ef.2f.fe.44.6e.84.f8. 

c9.3f.5f.8c.5c.41.00.00.00.00]>] 
sent [LCP ConfRej id=0x2 <callback CBCP> <mrru 1614>] 
rcvd [LCP ConfReg id=0x3 <asyncmap 0x0> <magic 0xa531e06> 

<pcomp> <accomp> 

<endpoint [local:12.92.67.ef.2f.fe.44.6e.84.f8. 

c9.3f.5f.8c.5c.41.00.00.00.00]>] 
sent [LCP ConfAck id=0x3 <asyncmap 0x0> <magic 0xa531e06> 

<pcomp> <accomp> 

<endpoint [local:12.92.67.ef.2f.fe.44.6e.84.f8. 

c9.3f.5f.8c.5c.41.00.00.00.00]>] 

sent [CHAP Challenge id=0xla <4d53c52b8e7dcfe7a9ea438b2b4daf55>, 
name = "dialer"] 

rcvd [LCP Ident id=0x4 magic=0xa531e06 "MSRASV5.20"] 
rcvd [LCP Ident id=0x5 magic=0xa531e06 "MSRAS-O-VISTA"] 
rcvd [CHAP Response id=0xla 

<4b5dc95ed4el788b959025de0233d4fc0000000 
00000000033a555d2a77bdlfa692f2a0af707cd 4f0c0072c379c82e0f00>, 
name = "dialer"] 

[CHAP Success id=0xla 

"S=7E0B6B513215C87520BEF6725EF8A9945C28E918M=Access granted"] 


sent 
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sent [CCP ConfReq id=0xl <mppe +H -M +S +L -D -C>] 

rcvd [IPV6CP ConfReq id=0x6 <addr fe80::0000:0000:dead:beef>] 

sent [IPV6CP TermAck id=0x6] 

rcvd [CCP ConfReq id=0x7 <mppe -H -M -S -L -D +C>] 

sent [CCP ConfNak id=0x7 <mppe +H -M +S +L -D -C>] 

rcvd [IPCP ConfReq id=0x8 <compress VJ Of 01> <addr 0.0.0.0> 

<ms-dnsl 0.0.0.0> <ms-wins 0.0.0.0> <ms-dns3 0.0.0.0> 

<ms-wins 0.0.0.0>] 
sent [IPCP TermAck id=0x8] 

rcvd [CCP ConfNak id=0xl <mppe -H -M +S -L -D -C>] 

sent [CCP ConfReq id=0x2 <mppe -H -M +S -L -D -C>] 

rcvd [CCP ConfReq id=0x9 <mppe -H -M +S -L -D -C>] 

sent [CCP ConfAck id=0x9 <mppe -H -M +S -L -D -C>] 

rcvd [CCP ConfAck id=0x2 <mppe -H -M +S -L -D -C>] 

MPPE 128-bit stateful compression enabled 

sent [IPCP ConfReq id=0xl <compress VJ Of 01> <addr 192.168.0.1>] 
sent [IPV6CP ConfReq id=0xl <addr fe80::0206:5bff:fedd:c5c3>] 
rcvd [IPCP ConfAck id=0xl <compress VJ Of 01> <addr 192.168.0.1>] 
rcvd [IPV6CP ConfAck id=0xl <addr fe80::0206:5bff:fedd:c5c3>] 
rcvd [IPCP ConfReq id=0xa <compress VJ Of 01> 

<addr 0.0.0.0> <ms-dnsl 0.0.0.0> 

<ms-wins 0.0.0.0> <ms-dns3 0.0.0.0> <ms-wins 0.0.0.0>] 
sent [IPCP ConfRej id=0xa <ms-wins 0.0.0.0> <ms-wins 0.0.0.0>] 
rcvd [IPV6CP ConfReq id=0xb <addr fe80::0000:0000:dead:beef>] 
sent [IPV6CP ConfAck id=0xb <addr fe80::0000:0000:dead:beef>] 
rcvd [IPCP ConfAck id=0xl <compress VJ Of 01> <addr 192.168.0.1>] 
rcvd [IPV6CP ConfAck id=0xl <addr fe80::0206:5bff:fedd:c5c3>] 
local LL address fe80::0206:5bff:fedd:c5c3 
remote LL address fe80::0000:0000:dead:beef 
rcvd [IPCP ConfReq id=0xc <compress VJ Of 01> 

<addr 0.0.0.0> <ms-dnsl 0.0.0.0> <ms-dns3 0.0.0.0>] 
sent [IPCP ConfNak id=0xc <addr 192.168.0.2> <ms-dnsl 192.168.0.1> 

<ms-dns3 192.168.0.1>] 

sent [IPCP ConfAck id=0xd <compress VJ Of 01> <addr 192.168.0.2> 

<ms-dnsl 192.168.0.1> <ms-dns3 192.168.0.1>] 
local IP address 192.168.0.1 
remote IP address 192.168.0.2 
... data ... 

Here we can see a somewhat involved PPP exchange, as viewed from the 
server. The PPP server process creates a (virtual) network interface called pppO, 
which is awaiting an incoming connection on the dial-up modem attached to 
serial port tty SO. Once the incoming connection arrives, the server requests an 
asyncmap of 0x0, EAP authentication, PFC, and ACFC. The client refuses EAP 
authentication and instead suggests MS-CHAP-v2 (ConfNak) [RFC2759]. The 
server then tries again, this time using MS-CHAP-v2, which is then accepted and 
acknowledged (ConfAck). Next, the incoming request includes CBCP; an MRRU 
of 1614 bytes, which is associated with MP support; and an endpoint ID. The server 
rejects the request for CBCP and multilink operation (ConfRej). The endpoint 
discriminator is once again sent by the client, this time without the MRRU, and is 
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accepted and acknowledged. Next, the server sends a CHAP challenge with the 
name dialer. Before a response to the challenge arrives, two incoming identity 
messages arrive, indicating that the peer is identified by fhe sfrings MSRASV5.20 
and MSRAS-0-VISTA. Finally, fhe CHAP response arrives and is validafed as cor¬ 
rect and an acknowledgmenf indicafes fhaf access is granfed. PPP fhen moves on 
fo fhe Nefwork sfafe. 

Once in fhe Nefwork sfafe, fhe CCP, IPCP, and IPV6CP NCPs are exchanged. 
CCP affempfs fo negofiafe Microsoft Point-to-Point Encryption (MPPE) [RFC3078]. 
MPPE is somewhaf of an anomaly, as if is really an encrypfion profocol, and rafher 
fhan compressing fhe packef if acfually expands if by 4 byfes. If does, however, 
provide a relafively simple means of esfablishing encrypfion early in fhe negofia- 
fion process. The opfions +H -M +S +L -D -C indicafe whefher MPPE sfafeless 
operafion is desired (H), whaf cryptographic key sfrengfh is available (secure, S; 
medium, M; or low, L), an obsolete D bif, and whefher a separafe, propriefary com¬ 
pression profocol called MPPC [REC2118] is desired (C). Evenfually fhe fwo peers 
agree on sfafeful mode using sfrong 128-bif keying (-H, +S). Note fhaf during fhe 
middle of fhis negofiafion, fhe clienf affempfs fo send an IPCP requesf, buf fhe 
server responds wifh an unsolicifed TermAck (a message defined wifhin LCP 
fhaf ICPC adopfs). This is used fo indicafe fo fhe peer fhaf fhe server is "in need of 
renegofiafion" [RFC1661]. 

After fhe successful negofiafion of MPPE, fhe server requesfs fhe use of VJ 
header compression and provides ifs IPv4 and IPv6 addresses, 192.168.0.1 and 
fe80:: 0206: 5bf f:fedd: c5c3. This IPv6 address is derived from fhe server's 
Efhernef MAC address 00:06:5B:DD:C5:C3. The clienf inifially suggesfs ifs IPv4 
address and name servers fo be 0.0.0.0 using IPCP, buf fhis is rejected. The clienf 
fhen requesfs fo use fe80::0000:0000:dead:beef as ifs IPv6 address, which 
is accepfed and acknowledged. Finally, fhe clienf ACKs bofh fhe IPv4 and IPv6 
addresses of fhe server, and fhe IPv6 addresses have been esfablished. Nexf, fhe 
clienf again requesfs IPv4 and server addresses of 0.0.0.0, which is rejecfed in 
favor of 192.168.0.1. These are accepfed and acknowledged. 

As we can see from fhis exchange, fhe PPP negofiafion is bofh flexible and 
fedious. There are many opfions fhaf can be affempfed, rejecfed, and renegofiafed. 
While fhis may nof be a big problem on a link wifh low delay, imagine how long 
fhis exchange could fake if each message fook a few seconds (or longer) fo reach ifs 
desfinafion, as mighf occur over a safellife link, for example. Link esfablishmenf 
would be a visibly long procedure for fhe user. 


3.7 Loopback 

Alfhough if may seem surprising, in many cases clienfs may wish fo communicafe 
wifh servers on fhe same computer using Infernef profocols such as TCP/IP To 
enable fhis, mosf implemenfafions supporf a nefwork-layer loopback capabilify fhaf 
fypically fakes fhe form of a virfual loopback nefwork interface. If acfs like a real 
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network interface but is really a special piece of soffware provided by fhe operaf- 
ing sysfem fo enable TCP/IP and ofher communicafions on fhe same hosf com- 
pufer. IPv4 addresses sfarfing wifh 127 are reserved for fhis, as is fhe IPv6 address 
::1 (see Chapfer 2 for IPv4 and IPv6 addressing convenfions). Tradifionally, UNIX- 
like sysfems including Linux assign fhe IPv4 address of 127.0.0.1 (::1 for IPv6) fo fhe 
loopback inferface and assign if fhe name localhost. An IP dafagram senf fo fhe 
loopback inferface musf nof appear on any nefwork. Alfhough we could imagine 
fhe fransporf layer defecfing fhaf fhe ofher end is a loopback address and shorf- 
circuifing some of fhe fransporf-layer logic and all of fhe nefwork-layer logic, mosf 
implemenfafions perform complefe processing of fhe dafa in fhe fransporf layer 
and nefwork layer and loop fhe IP dafagram back up in fhe nefwork sfack only 
when fhe dafagram leaves fhe boffom of fhe nefwork layer. This can be useful for 
performance measuremenf, for example, because fhe amounf of fime required fo 
execufe fhe sfack soffware can be measured wifhouf any hardware overheads. In 
Linux, fhe loopback inferface is called lo. 


Linux% ifconfig lo 

lo Link encap:Local Loopback 

inet addr:127.0.0.1 Mask:255.0.0.0 
inet6 addr: ::1/128 Scope:Host 
UP LOOPBACK RUNNING MTU:16436 Metric:! 

RX packets:458511 errors:0 dropped:0 overruns:0 frame:0 
TX packets:458511 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:0 
RX bYtes:266049199 (253.7 MiB) 

TX bYtes:266049199 (253.7 MiB) 

Here we see fhaf fhe local loopback inferface has fhe IPv4 address 127.0.0.1 
and a subnef mask of 255.0.0.0 (corresponding fo class A nefwork number 127 
in classful addressing). The IPv6 address :: 1 has a 128-bif-long prefix, so if repre- 
senfs only a single address. The inferface has an MTU of 16KB (fhis can be config¬ 
ured fo a much larger size, up fo 2GB). A significanf amounf of fraffic, nearly half a 
million packefs, has passed fhrough fhe inferface wifhouf error since fhe machine 
was inifialized fwo monfhs earlier. We would nof expecf fo see errors on fhe local 
loopback device, given fhaf if never really sends packefs on any nefwork. 

In Windows, fhe Microsoff Loopback Adapfer is nof insfalled by defaulf, even 
fhough IP loopback is sfill supporfed. This adapfer can be used for fesfing various 
nefwork configurafions even when a physical nefwork inferface is nof available. 
To insfall if under Windows XP, selecf Sfarf I Confrol Panel I Add Hardware I 
Selecf Nefwork Adapfers from lisf I Selecf Microsoff as manufacfurer I Selecf 
Microsoff Loopback Adapfer. For Windows Visfa or Windows 7, run fhe program 
hdwwiz from fhe command prompf and add fhe Microsoff Loopback Adapfer 
manually. Once fhis is performed, fhe ipconf ig command reveals fhe following 
(fhis example is from Windows Visfa): 
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C:\> ipconfig /all 

Ethernet adapter Local Area Connection 2: 

Connection-specific DNS Suffix . : 

Description . : Microsoft Loopback Adapter 

Physical Address.: 02-00-4C-4F-4F-50 

DHCP Enabled.: Yes 

Autoconfiguration Enabled . . . . : Yes 

Link-local IPv6 Address . : 

feSO::9c0d:77a:52b8:39f0%18(Preferred) 

Autoconfiguration IPv4 Address. . : 169.254.57.240(Preferred) 

Subnet Mask . : 255.255.0.0 

Default Gateway . : 

DHCPv6 lAID.: 302121036 

DNS Servers . : fecO:0:0:ffff::1%1 

fec0:0:0:ffff::2%1 
fec0:0:0:ffff::3%1 

NetBIOS over Tcpip.: Enabled 

Here we can see that the interface has been created, has been assigned both 
IPv4 and IPv6 addresses, and appears as a sort of virfual Efhernef device. Now fhe 
machine has several loopback addresses: 


C: \> ping 127.1.2.3 

Pinging 127.1.2.3 with 32 bytes of data: 

Reply from 127.1.2.3: bytes=32 time<lms TTL=128 
Reply from 127.1.2.3: bytes=32 time<lms TTL=128 
Reply from 127.1.2.3: bytes=32 time<lms TTL=128 
Reply from 127.1.2.3: bytes=32 time<lms TTL=128 
Ping statistics for 127.1.2.3: 

Packets: Sent = 4, Received = 4, Lost = 0 {0% loss), 
Approximate round trip times in milli-seconds: 

Minimum = 0ms, Maximum = 0ms, Average = 0ms 

C:\> ping z z1 

Pinging ::1 from ::1 with 32 bytes of data: 

Reply from ::1: time<lms 
Reply from ::1: time<lms 
Reply from ::1: time<lms 
Reply from ::1: time<lms 
Ping statistics for ::1: 

Packets: Sent = 4, Received = 4, Lost = 0 {0% loss), 
Approximate round trip times in milli-seconds: 

Minimum = 0ms, Maximum = 0ms, Average = 0ms 

C:\> ping 169.254.57.240 

Pinging 169.254.57.240127.1.2.3 with 32 bytes of data: 

Reply from 169.254.57.240: bytes=32 time<lms TTL=128 
Reply from 169.254.57.240: bytes=32 time<lms TTL=128 
Reply from 169.254.57.240: bytes=32 time<lms TTL=128 
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Reply from 169.254.57.240: bYtes=32 time<lms TTL=128 
Ping statistics for 169.254.57.240: 

Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), 
Approximate round trip times in milli-seconds: 

Minimum = 0ms, Maximum = 0ms, Average = 0ms 


Here we can see that in IPv4, any destination address starting with 127 is 
looped back. For IPv6, however, only the single address :: 1 is defined for loopback 
operaf ion. We can also see how f he loopback adapfer wif h address 169.254.57.240 
refurned dafa immediafely. One subflefy fo which we will refurn in Chapfer 9 is 
whefher mulficasf or broadcasf dafagrams should be copied back fo fhe sending 
compufer (over fhe loopback inf erf ace). This choice can be made by each indi¬ 
vidual applicafion. 


3.8 MTU and Path MTU 

As we can see from Figure 3-3, fhere is a limif on fhe size of fhe frame available for 
carrying fhe PDUs of higher-layer profocols in many link-layer nefworks such as 
Efhernef. This usually limifs fhe number of payload byfes fo abouf 1500 for Efh- 
ernef and offen fhe same amounf for PPP in order fo mainfain compafibilify wifh 
Efhernef. This characferisfic of fhe link layer is called fhe maximum transmission 
unit (MTU). Mosf packef nefworks (like Efhernef) have a fixed upper limif. Mosf 
sfream-fype nefworks (serial links) have a configurable limif fhaf is fhen used by 
framing profocols such as PPP. If IP has a dafagram fo send, and fhe dafagram is 
larger fhan fhe link layer's MTU, IP performs fragmentation, breaking fhe dafa¬ 
gram up info smaller pieces (fragmenfs), so fhaf each fragmenf is smaller fhan fhe 
MTU. We discuss IP fragmenfafion in Chapfers 5 and 10. 

When fwo hosfs on fhe same nefwork are communicafing wifh each ofher, if is 
fhe MTU of fhe local link inferconnecfing fhem fhaf has a direcf effecf on fhe size 
of dafagrams fhaf are used during fhe conversafion. When fwo hosfs communi- 
cafe across mulfiple nefworks, each link can have a differenf MTU. The minimum 
MTU across fhe nefwork pafh comprising all of fhe links is called fhe path MTU. 

The pafh MTU befween any fwo hosfs need nof be consfanf over fime. If 
depends on fhe pafh being used af any fime, which can change if fhe roufers or 
links in fhe nefwork fail. Also, pafhs are offen nof symmetric (i.e., fhe pafh from 
hosf A fo B may nof be fhe reverse of fhe pafh from B fo A); hence fhe pafh MTU 
need nof be fhe same in fhe fwo direcf ions. 

[RFC1191] specifies fhe path MTU discovery (PMTUD) mechanism for IPv4, 
and [RFC1981] describes if for IPv6. A complemenfary approach fhaf avoids some 
of fhe issues wifh fhese mechanisms is described in [RFC4821]. PMTU discovery is 
used fo defermine fhe pafh MTU af a poinf in fime and is required of IPv6 imple- 
menfafions. In lafer chapfers we shall see how fhis mechanism operafes affer we 
have described ICMP and IP fragmenfafion. We shall also see whaf effecf if can 
have on fransporf performance when we discuss TCP and UDP 
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3.9 Tunneling Basics 

In some cases it is useful to establish a virtual link between one computer and 
another across the Internet or other network. VPNs, for example, offer this type of 
service. The method most commonly used to implement these types of services is 
called tunneling. Tunneling, generally speaking, is the idea of carrying lower-layer 
traffic in higher-layer (or equal-layer) packets. For example, IPv4 can be carried in 
an IPv4 or IPv6 packet; Ethernet can be carried in a UDP or IPv4 or IPv6 packet, 
and so on. Tunneling turns the idea of strict layering of protocols on its head and 
allows for the formation of overlay networks (i.e., networks where the "links" are 
really virtual links implemented in some other protocol instead of physical con¬ 
nections). It is a very powerful and useful technique. Here we discuss the basics 
of some of the tunneling options. 

There is a great variety of methods for tunneling packets of one protocol 
and/or layer over another. Three of the more common protocols used to establish 
tunnels include Generic Routing Encapsulation (GRE) [REC2784], the Microsoft pro¬ 
prietary Point-to-Point Tunneling Protocol (PPTP) [REC2637], and the Layer 2 Tun¬ 
neling Protocol (L2TP) [REC3931]. Others include the earlier nonstandard IP-in-IP 
tunneling protocol [REC1853]. GRE and LT2P were developed to standardize and 
replace IP-in-IP and PPTP, respectively, but all of these approaches are still in use. 
We shall focus on GRE and PPTP, with more emphasis on PPTP, as it is more visible 
to individual users even though it is not an IETF standard. L2TP is often used with 
security at the IP layer (IPsec; see Ghapter 18) because L2TP by itself does not pro¬ 
vide security. Because GRE and PPTP are closely related, we now look at the GRE 
header in Figure 3-26, in both its original standard and revised standard forms. 


0 12 15 31 


c 

Reserved (0) 

(12 bits) 

Version 
(3 bits) 

Protocol Type 
(16 bits) 

Checksum (Optional) 

Reserved 1 (Optional) 


0 12 15 31 


P Q: Reserved (0) Version 

^ ^ ^ (9 bits) (3 bits) 

Protocol Type 
(16 bits) 

Checksum (Optional) 

Reserved 1 (Optional) 

Key (Optional) 

Sequence Number (Optional) 


Figure 3-26 The basic GRE header is only 4 bytes but includes the option of a 16-bit checksum (of a type com¬ 
mon to many Internet protocols). The header was later extended to include an identifier {Key field) 
common to multiple packets in a flow, and a Sequence Number, to help in resequencing packets that 
get out of order. 
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As can be seen from the headers in Figure 3-26, the baseline GRE specification 
[RFC2784] is rather simple and provides only a minimal encapsulation for other 
packets. The first bit field (C) indicates whether a checksum is present. If it is, 
the Checksum field contains the same type of checksum found in many Internet- 
related protocols (see Section 5.2.2). If the Checksum field is present, the Reservedl 
field is also present and is set to 0. [RFC2890] extends the basic format to include 
optional Key and Sequence Number fields, present if the K and S bit fields from 
Figure 3-26 are set to 1, respectively. If present, the Key field is arranged to be a 
common value in multiple packets, indicating that they belong to the same flow 
of packets. The Sequence Number field is used in order to reorder packets if they 
should become out of sequence (e.g., by going through different links). 

Although GRE forms the basis for and is used by PPTP, the two protocols serve 
somewhat different purposes. GRE tunnels are typically used within the network 
infrastructure to carry traffic between ISPs or within an enterprise intranet to 
serve branch offices and are not necessarily encrypted, although GRE tunnels can 
be combined with IPsec. PPTP, conversely, is most often used between users and 
their ISPs or corporate intranets and is encrypted (e.g., using MPPE). PPTP essen¬ 
tially combines GRE with PPP, so GRE can provide the virtual point-to-point link 
upon which PPP operates. GRE carries its traffic using IPv4 or IPv6 and as such 
is a layer 3 tunneling technology. PPTP is more often used to carry layer 2 frames 
(such as Ethernet) so as to emulate a direct LAN (link-layer) connection. This can 
be used for remote access to corporate networks, for example. PPTP uses a non¬ 
standard variation on the standard GRE header (see Figure 3-27). 


0 12 15 31 


C R K S s Recur A Flags 

^ (3 bits) 

Protocol Type 
(16 bits) 

Key (HW) Payload Length 

Key (LW) Call ID 

Sequence Number (Optional) 

Acknowledgment Number (Optional) 


Figure 3-27 The PPTP header is based on an older, nonstandard GRE header. It includes a sequence number, 
a cumulative packet acknowledgment number, and some identification information. Most of the 
fields in the first word are set to 0. 


We can see a number of differences in Figure 3-27 from the standard GRE 
header, including the extra R, s, and A bit fields, additional Flags field, and Recur 
field. Most of these are simply set to 0 and not used (their assignment is based on 
an older, nonstandard version of GRE). The K, S, and A bit fields indicate that the 
Key, Sequence Number, and Acknowledgment Number fields are present. If present, 
the value of the Sequence Number field holds the largest packet number seen by the 
peer. 
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We now turn to the establishment of a PPTP session. We shall conclude later 
with a brief discussion of some of PPTP's ofher capabilifies. The following example 
is similar fo fhe PPP link esfablishmenf example given earlier, excepf now insfead 
of using a dial-up link, PPTP is providing fhe "raw" link fo PPP. Once again, fhe 
clienf is Windows Visfa, and fhe server is Linux. This oufpuf comes from fhe 
/var/log/messages file when fhe debug opfion is enabled: 

pptpd: MGR: Manager process started 

pptpd: MGR: Maximum of 100 connections available 

pptpd: MGR: Launching /usr/sbin/pptpctrl to handle client 

pptpd: CTRL: local address = 192.168.0.1 

pptpd: CTRL: remote address = 192.168.1.1 

pptpd: CTRL: pppd options file = /etc/ppp/options.pptpd 

pptpd: CTRL: Client 71.141.227.30 control connection started 

pptpd: CTRL: Received PPTP Control Message (type: 1) 

pptpd: CTRL: Made a START CTRL CONN RPLY packet 

pptpd: CTRL: I wrote 156 bytes to the client. 

pptpd: CTRL: Sent packet to client 

pptpd: CTRL: Received PPTP Control Message (type: 7) 

pptpd: CTRL: Set parameters to 100000000 maxbps, 64 window size 
pptpd: CTRL: Made a OUT CALL RPLY packet 

pptpd: CTRL: Starting call (launching pppd, opening GRE) 
pptpd: CTRL: ptY_fd = 6 
pptpd: CTRL: ttY_fd = 7 

pptpd: CTRL (PPPD Launcher): program binary = /usr/sbin/pppd 
pptpd: CTRL (PPPD Launcher): local address = 192.168.0.1 
pptpd: CTRL (PPPD Launcher): remote address = 192.168.1.1 
pppd: pppd 2.4.4 started by root, uid 0 
pppd: using channel 60 

pptpd: CTRL: I wrote 32 bytes to the client, 
pptpd: CTRL: Sent packet to client 
pppd: Using interface pppO 
pppd: Connect: pppO <--> /dev/pts/1 

pppd: sent [LCP ConfReq id=0xl <asyncmap 0x0> <auth chap MS-v2> 

<magic 0x4e2ca200> <pcomp> <accomp>] 
pptpd: CTRL: Received PPTP Control Message (type: 15) 
pptpd: CTRL: Got a SET LINK INFO packet with standard ACCMs 
pptpd: GRE: accepting packet #0 

pppd: rcvd [LCP ConfReq id=0x0 <mru 1400> <magic 0x5e565505> 

<pcomp> <accomp>] 

pppd: sent [LCP ConfAck id=0x0 <mru 1400> <magic 0x5e565505> 

<pcomp> <accomp>] 

pppd: sent [LCP ConfReq id=0xl <asyncmap 0x0> <auth chap MS-v2> 

<magic 0x4e2ca200> <pcomp> <accomp>] 
pptpd: GRE: accepting packet #1 

pppd: rcvd [LCP ConfAck id=0xl <asyncmap 0x0> <auth chap MS-v2> 

<magic 0x4e2ca200> <pcomp> <accomp>] 
pppd: sent [CHAP Challenge id=0x3 

<eb88bfff67dlc239ef73e98ca32646a5>, name = "dialer"] 
pptpd: CTRL: Received PPTP Control Message (type: 15) 
pptpd: CTRL: Ignored a SET LINK INFO packet with real ACCMs! 
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pptpd 

pppd: 


PPPd: 


pppd: 

pptpd 

pppd: 

pppd: 

pptpd 

pppd: 

pppd: 
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pptpd 

pppd: 

pptpd 
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pppd: 

pppd: 

pppd: 

pppd: 

pptpd 

pppd: 

pppd: 

pppd: 

pppd: 

pptpd 

pppd: 

pptpd 

pppd: 

pptpd 

pppd: 


pppd: 

pptpd 

pppd: 

pppd: 

pppd: 

pppd: 

pptpd 

pppd: 

pppd: 

pptpd 

pppd: 

pppd: 


GRE: accepting packet #2 

rcvd [CHAP Response id=0x3<276f3678fOf03fa57f64b3c367529565000000 
00000000000fa2b2ae0ad8db9d986f8e222a0217a620638a24 
3179160900>, name = "dialer"] 
sent [CHAP Success id=0x3 

"S=C551119E0E1AAB68E86DED09A32D0346D7002E05 
M=Accessgranted"] 

sent [CCP ConfReq id=0xl <mppe +H -M +S +L -D -C>] 

GRE: accepting packet #3 

rcvd [IPV6CP ConfReq id=0xl <addr fe80::Icfc:fddd:8e2c:ell8>] 
sent [IPV6CP TermAck id=0xl] 

GRE: accepting packet #4 

rcvd [CCP ConfReq id=0x2 <mppe +H -M -S -L -D -C>] 

sent [CCP ConfNak id=0x2 <mppe +H -M +S +L -D -C>] 

GRE: accepting packet #5 
GRE: accepting packet #6 

rcvd [IPCP ConfReq id=0x3 <addr 0.0.0.0> <ms-dnsl 0.0.0.0> 

<ms-wins 0.0.0.0> <ms-dns3 0.0.0.0> <ms-wins 0.0.0.0>] 

GRE: accepting packet #7 
sent [IPCP TermAck id=0x3] 

rcvd [CCP ConfNak id=0xl <mppe +H -M +S -L -D -C>] 

sent [CCP ConfReq id=0x2 <mppe +H -M +S -L -D -C>] 

rcvd [CCP ConfReq id=0x4 <mppe +H -M +S -L -D -C>] 

sent [CCP ConfAck id=0x4 <mppe +H -M +S -L -D -C>] 

GRE: accepting packet #8 

rcvd [CCP ConfAck id=0x2 <mppe +H -M +S -L -D -C>] 

MPPE 128-bit stateless compression enabled 

sent [IPCP ConfReq id=0xl <addr 192.168.0.I>] 

sent [IPV6CP ConfReq id=0xl <addr fe80::0206:5bff:fedd:c5c3>] 

GRE: accepting packet #9 
rcvd [IPCP ConfAck id=0xl <addr 192.168.0.1>] 

GRE: accepting packet #10 

rcvd [IPV6CP ConfAck id=0xl <addr fe80::0206:5bff:fedd:c5c3>] 

GRE: accepting packet #11 
rcvd [IPCP ConfReq id=0x5 <addr 0.0.0.0> 

<ms-dnsl 0.0.0.0> <ms-wins 0.0.0.0> 

<ms-dns3 0.0.0.0> <ms-wins 0.0.0.0>] 
sent [IPCP ConfRej id=0x5 <ms-wins 0.0.0.0> <ms-wins 0.0.0.0>] 
GRE: accepting packet #12 

rcvd [IPV6CP ConfReq id=0x6 <addr fe80::Icfc:fddd:8e2c:ell8>] 
sent [IPV6CP ConfAck id=0x6 <addr fe80::Icfc:fddd:8e2c:ell8>] 
local LL address fe80::0206:5bff:fedd:c5c3 
remote LL address fe80::Icfc:fddd:8e2c:ell8 
GRE: accepting packet #13 
rcvd [IPCP ConfReq id=0x7 <addr 0.0.0.0> 

<ms-dnsl 0.0.0.0> <ms-dns3 0.0.0.0>] 
sent [IPCP ConfNak id=0x7 <addr 192.168.1.I> 

<ms-dnsl 192.168.0.1> <ms-dns3 192.168.0.1>] 

GRE: accepting packet #14 
rcvd [IPCP ConfReq id=0x8 <addr 192.168.1.1> 

<ms-dnsl 192.168.0.1> <ms-dns3 192.168.0.1>] 
sent [IPCP ConfAck id=0x8 <addr 192.168.1.1> 

<ms-dnsl 192.168.0.1> <ms-dns3 192.168.0.1>] 
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pppd: 

pppd: 

pptpd 

pptpd 

pptpd 

pptpd 

pptpd 


local IP address 192.168.0.1 
remote IP address 192.168.1.1 
GRE: accepting packet #15 
CTRL: Sending ECHO REQ id 1 
CTRL: Made a ECHO REQ packet 
CTRL: I wrote 16 bytes to the client. 
CTRL: Sent packet to client 


This output looks similar to the PPP example we examined earlier, except this 
one has output from both the pppd process as well as a pptpd process. These 
processes work together to establish PPTP sessions at the server. The setup begins 
with pptpd receiving a type 1 control message, indicating that the client wishes 
to establish a control connection. PPTP uses a separate control and data stream, 
so first the control stream is set up. After responding to this request, the server 
receives a type 7 control message indicating an outgoing call request from fhe peer. 
The maximum speed (in bifs per second) is sef fo a large value of 100,000,000, which 
effecfively means if is unbounded. The window is sef fo 64, a concepf we fypically 
encounfer in fransporf protocols such as TCP (see Chapter 15). Here fhe window 
is used for flow confrol. Thai is, PPTP uses ifs sequence numbers and acknowledg- 
menf numbers fo defermine how many frames reach fhe desfinafion successfully. If 
too few frames are successfully delivered, fhe sender slows down. To defermine fhe 
amounf of fime fo waif for an acknowledgmenf for frames if sends, PPTP uses an 
adapfive fimeouf mechanism based on esfimafing fhe round-frip fime of fhe link. 
We shall see fhis type of calculafion again when we sfudy TCP. 

Soon affer fhe window is sef, fhe pppd applicafion begins fo run and process 
fhe PPP dafa as we saw before in fhe dial-up example. The only real difference 
befween fhe fwo is fhaf pptpd relays packefs fo fhe pppd process as fhey arrive 
and deparf, and a few special PPTP messages (such as set link info and echo 
request) are processed by pptpd ifself. This example illusfrafes how fhe PPTP 
profocol really acfs as a GRE funneling agenf for PPP packefs. This is convenienf 
because an exisfing PPP implemenfafion (here, pppd) can be used as is fo process 
fhe encapsulated PPP packefs. Note fhaf while GRE is ifself ordinarily encapsu- 
lafed in IPv4 packefs, similar funcfionalify is available using IPv6 fo funnel pack¬ 
efs [RPG2473]. 


3.9.1 Unidirectional Links 

An inferesfing issue arises when fhe link fo be used operafes in only one direc- 
fion. Such links are called unidirectional links (UDLs), and many of fhe protocols 
described so far do nof operate properly in such circumsfances because fhey 
require exchanges of informafion (e.g., PPP's configurafion messages). To deal 
wifh fhis sifuafion, a sfandard has been creafed whereby funneling over a sec¬ 
ond Infernef inferface can be combined wifh operafion of fhe UDL [RPG3077]. The 
fypical sifuafion where fhis arises is an Infernef connecfion fhaf uses a safellife for 
downsfream fraffic (headed fo fhe user) and a dial-up modem link for upsfream 





154 


Link Layer 


traffic. This sefup can be useful in cases where fhe safellife-connecfed user's usage 
is dominafed by downloading as opposed fo uploading and was commonly used 
in early safellife Infernef insfallafions. If operafes by encapsulafing link-layer 
upsfream fraffic in IP packefs using a GRE encapsulafion. 

To esfablish and mainfain funnels aufomafically af fhe receiver, [RFC3077] 
specifies a Dynamic Tunnel Configuration Protocol (DTCP). DTCP involves send¬ 
ing mulficasf Hello messages on fhe downlink so fhaf any inferesfed receiver can 
learn abouf fhe exisfence of fhe UDL and ifs MAC and IP addresses. In addifion. 
Hello messages indicafe a lisf of funnel endpoinfs wifhin fhe nefwork fhaf can be 
reached by fhe user's secondary inferface. Affer fhe user selecfs which funnel end- 
poinf fo use, DTCP arranges for refurn fraffic fo be encapsulafed wifh fhe same 
MAC type as the UDL in GRE tunnels. The service provider arranges to receive 
these GRE-encapsulated layer 2 frames (frequenfly Efhernef), exfracf fhem from 
fhe funnel, and forward fhem appropriafely. Thus, alfhough fhe upsfream side of 
fhe UDLs (provider's side) requires manual funnel configurafion, fhe downsfream 
side, which includes many more users, has aufomafically configured funnels. 
Nofe fhaf fhis approach fo handling UDLs essenfially "hides" fhe link asymme- 
fry from fhe upper-layer profocols. As a consequence, fhe performance (lafency, 
bandwidfh) of fhe "fwo" direcfions of fhe link may be highly asymmefric and may 
adversely affecf higher-layer profocols [RFC3449]. 

As fhe safellife example helps fo illusfrafe, one significanf issue wifh funnels 
is fhe amounf of efforf required fo configure fhem, which has fradifionally been 
done by hand. Typically, funnel configurafion involves selecfing fhe endpoinfs of 
a funnel and configuring fhe devices locafed af fhe funnel endpoinfs wifh an IP 
address of fhe peer, and perhaps also providing protocol selecfion and aufhenfica- 
fion informafion. A number of techniques have arisen fo help in configuring or 
using funnels aufomafically. One such approach specified for fransifioning from 
IPv4 fo IPv6 is called 6to4 [RFC3056]. In 6fo4, IPv6 packefs are funneled over an 
IPv4 nefwork using fhe encapsulafion specified in [RFC3056]. A problem wifh fhis 
approach occurs when corresponding hosfs are locafed behind nefwork address 
franslafors (see Chapfer 7). This is common foday, especially for home users. Deal¬ 
ing wifh fhe IPv6 fransifion using aufomafically configured funnels is specified in 
an approach called Teredo [RFC4380]. Teredo funnels IPv6 packefs over UDP/IPv4 
packefs. Because fhis approach requires some background in IPv4 and IPv6, as 
well as UDP, we posfpone any defailed discussion of such funnel aufoconfigura- 
fion opfions fo Chapfer 10. 


3.10 Attacks on the Link Layer 

Affacking layers below TCP/IP in order fo affecf fhe operafions of TCP/IP nef- 
works has been a popular approach because much of fhe link-layer informafion is 
nof shared by fhe higher layers and can fherefore be somewhaf difficulf fo defecf 
and mifigafe. Neverfheless, many such affacks are now well understood, and we 
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mention a few of fhem here fo beffer undersfand how problems af fhe link layer 
can affecf higher-layer operafions. 

In convenfional wired Efhernef, inferfaces can be placed in promiscuous mode, 
which allows fhem fo receive fraffic even if if is nof desfined for fhem. In fhe early 
days of Efhernef, when fhe medium was liferally a shared cable, fhis capabilify 
allowed anyone wifh a compufer attached fo fhe Efhernef cable fo "sniff" anybody 
else's frames and inspecf fheir confenfs. As many higher-layer profocols af fhe fime 
included sensifive informafion such as passwords, if was nearly frivial fo infercepf 
a person's password by merely looking af fhe ASCII decode of a packef frace. Two 
factors have affected fhis approach subsfanfially: fhe deploymenf of swifches and 
fhe deploymenf of encrypfion in higher-layer profocols. Wifh swifches, fhe only 
fraffic fhaf is provided on a swifch porf fo which an end sfafion is affached is fraf¬ 
fic desfined for fhe sfafion ifself (or ofhers for which if is bridging) and broadcasf/ 
mulficasf fraffic. As fhis type of fraffic rarely confains informafion such as pass¬ 
words, fhe affack is largely fhwarfed. Much more effecfive, however, is simply fhe 
use of encrypfion af higher layers, which is now common. In fhis case, sniffing 
packefs leads fo liffle benefif as fhe confenfs are essenfially impossible fo read. 

Anofher fype of affack fargefs fhe operafion of swifches. Recall fhaf swifches 
hold fables of sfafions on a per-porf basis. If fhese fables are able fo be filled quickly 
(e.g., by quickly masquerading as a large number of sfafions), if is conceivable fhaf 
fhe swifch mighf be forced info discarding legifimafe enfries, leading fo service 
inferrupfion for legifimafe sfafions. A related buf probably worse affack can be 
mounfed using fhe STR In fhis case, an affacking sfafion can masquerade as a 
swifch wifh a low-cosf pafh fo fhe roof bridge and cause fraffic fo be direcfed 
toward if. 

Wifh Wi-Ei nefworks, some of fhe eavesdropping and masquerading issues 
presenf in wired Efhernef nefworks are exacerbafed, as any sfafion can enfer a 
moniforing mode and sniff packefs from fhe air (alfhough placing an 802.11 infer- 
face info moniforing mode fends fo be more challenging fhan placing an Efhernef 
interface info promiscuous mode, as doing so depends on an appropriafe device 
driver). Some of fhe earliesf "affacks" (which may nof really have been attacks, 
depending on fhe relevanf legal framework) involved simply roaming abouf while 
scanning, looking for access poinfs providing Infernef connecfivify (i.e., war driv¬ 
ing). Alfhough many access poinfs use encrypfion fo limif access fo aufhorized 
users, ofhers are eifher open or use so-called capturing portals fhaf direcf a would- 
be user fo a regisfrafion Web page and fhen filter access based on MAC address. 
Capfuring porfal systems have been subverfed by observing a sfafion as if regis- 
fers and "hijacking" fhe connecfion as if is formed by impersonafing fhe legifi¬ 
mafe registering user. 

A more sophisficafed sef of affacks on Wi-Ei involves affacking fhe crypto¬ 
graphic profecfion, especially fhe WEP encrypfion used on many early access 
poinfs. Affacks on WEP [BHL06] were sufficienfly devasfafing so as fo prod fhe 
IEEE info revising fhe sfandard. The more recenf WPA2 encrypfion framework 
(and WPA, fo a lesser exfenf) is known fo be significanfly sfronger, and WEP is no 
longer recommended for use. 
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PPP links can be attacked in a number of ways if fhe affacker can gain access 
fo fhe channel befween fhe fwo peers. For very simple aufhenficafion mechanisms 
(e.g., PAP), sniffing can be used fo capfure fhe password in order fo facilifafe ille- 
gifimafe subsequenf use. Depending on fhe type of higher-layer fraffic being car¬ 
ried over fhe PPP link (e.g., roufing fraffic), addifional unwanfed behaviors can be 
induced. 

In ferms of affacks, funneling can play fhe role of bofh fargef and fool. In 
ferms of a fargef, funnels pass fhrough a nefwork (offen fhe Infernef) and fhus are 
subjecf fo being infercepfed and analyzed. The configured funnel endpoinfs can 
also be attacked, eifher by affempfing fo esfablish more funnels fhan fhe endpoinf 
can supporf (a DoS affack) or by affacking fhe configurafion ifself. If fhe configura- 
fion is compromised, if may be possible fo open an unaufhorized funnel fo an end¬ 
poinf. Af fhis poinf fhe funnel becomes a fool rafher fhan a fargef, and protocols 
such as L2TP can provide a convenienf profocol-independenf mefhod of gaining 
access fo private infernal nefworks af fhe link layer. In one GRE-relafed attack, for 
example, fraffic is simply inserfed in a nonencrypfed funnel, where if appears af 
fhe funnel endpoinf and is injecfed fo fhe affached "privafe" nefwork as fhough if 
were senf locally. 


3.11 Summary 

In fhis chapfer we examined fhe lowesf layer in fhe Infernef profocol suite wifh 
which we are concerned—fhe link layer. We looked af fhe evolufion of Efhernef, 
in ferms of bofh ifs increases in speed from lOMb/s fo lOGb/s and beyond, as well 
as ifs evolufion of capabilifies, including VLANs, priorifies, link aggregafion, and 
frame formafs. We saw how swifches provide improved performance over bridges 
by implemenfing a direcf elecfrical pafh befween mulfiple independenf sefs of sfa- 
fions, and how full-duplex operafion has largely replaced fhe earlier half-duplex 
operafion. We also looked af fhe IEEE 802.11 wireless LAN "Wi-Ei" sfandard in 
some defail, nofing ifs similarifies and differences wifh respecf fo Efhernef. If has 
become one of fhe mosf popular IEEE sfandards and provides license-free nef¬ 
work access across fhe fwo primary bands of 2.4GHz and 5GHz. We also looked 
af fhe evolufion of fhe securify mefhods for Wi-Ei, wifh fhe evolufion from fhe 
relafively weak WEP fo fhe more formidable WPA and WPA2 frameworks. Mov¬ 
ing beyond IEEE sfandards, we discussed poinf-fo-poinf links and fhe PPP pro¬ 
focol. PPP can encapsulate essenfially any kind of packefs used for TGP/IP and 
non-TGP/IP nefworks using an HDLG-like frame formaf, and if is used on links 
ranging from low-speed dial-up modems fo high-speed fiber-opfic lines. If is a 
whole suite of protocols ifself, including mefhods for compression, encrypfion, 
aufhenficafion, and link aggregafion. Because if supporfs only fwo parfies, if does 
nof have fo deal wifh confrolling access fo a shared medium like fhe MAG proto¬ 
cols of Efhernef or Wi-Ei. 
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The loopback interface is provided by most implementations. Access to this 
interface is either through the special loopback address, normally 127.0.0.1 (::1 for 
IPv6), or by sending IP datagrams to one of a host's own IP addresses. Loopback 
data has been completely processed by the transport layer and by IP when it loops 
around to go up the protocol stack. We described an important feature of many 
link layers, the MTU, and the related concept of a path MTU. 

We also discussed the use of tunneling, which involves carrying lower-layer 
protocols in higher-layer (or equal-layer) packets. This technique allows for the 
formation of overlay networks, using tunnels over the Internet as links in another 
level of network infrastructure. This technique has become very popular, both for 
experimentation with new capabilities (e.g., running an IPv6 network overlay on 
an IPv4 internet) and for operational use (e.g., with VPNs). 

We concluded the chapter with a brief discussion of the types of attacks 
involving the link layer—as either target or tool. Many attacks simply involve 
intercepting traffic for analysis (e.g., looking for passwords), but more sophisti¬ 
cated attacks involve masquerading as endpoints or modifying traffic in transit. 
Other attacks involve compromising control information such as tunnel endpoints 
or the STP to direct traffic to otherwise unintended locations. Access to the link 
layer also provides an attacker with a general way to perform DoS attacks. Perhaps 
the best-known variant of this is jamming communication signals, an endeavor 
undertaken by certain parties since nearly the advent of radio. 

This chapter has covered only some of the common link technologies used 
with TCP/IP today. One reason for the success of TCP/IP is its ability to work on 
top of almost any link technology. In essence, IP requires only that there exists 
some path between sender and receiver(s) across a cascade of intermediate links. 
Although this is a relatively modest requirement, some research is aimed at 
stretching this even farther—to cases where there may never be an end-to-end 
path between sender and receiver(s) at any single point in time [RFC4838]. 
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ARP: Address Resolution 
Protocol 


4.1 Introduction 

We have seen that the IP protocol is designed to provide interoperability of packet 
switching across a large variety of physical nefwork fypes. Doing so requires, 
among ofher fhings, converfing befween fhe addresses used by fhe nefwork-layer 
software and fhose inferprefed by fhe underlying nefwork hardware. Generally, 
nefwork inferface hardware has one primary hardware address (e.g., a 48-bif value 
for an Efhernef or 802.11 wireless inferface). Frames exchanged by fhe hardware 
musf be addressed fo fhe correcf inferface using fhe correcf hardware addresses; 
ofherwise, no dafa can be fransferred. Buf a convenfional IPv4 nefwork works 
wifh ifs own addresses: 32-bif IPv4 addresses. Knowing a hosf's IP address is 
insufficienf for fhe sysfem fo send a frame fo fhaf hosf efficienfly on nefworks 
where hardware addresses are used. The operafing sysfem software (i.e., fhe Efh¬ 
ernef driver) musf know fhe desfinafion's hardware address fo send dafa direcfly. 
For TCP/IP nefworks, fhe Address Resolution Protocol (ARP) [RFC0826] provides a 
dynamic mapping befween IPv4 addresses and fhe hardware addresses used by 
various nefwork fechnologies. ARP is used wifh IPv4 only; IPv6 uses fhe Neigh¬ 
bor Discovery Protocol, which is incorporated info ICMPv6 (see Chapter 8). 

If is imporfanf fo nofe here fhaf fhe nefwork-layer and link-layer addresses 
are assigned by differenf aufhorifies. For nefwork hardware, fhe primary address 
is defined by fhe manufacfurer of fhe device and is stored in permanenf mem¬ 
ory wifhin fhe device, so if does nof change. Thus, any protocol suife designed fo 
operate wifh fhaf parficular hardware technology musf make use of ifs parficular 
fypes of addresses. This allows nefwork-layer profocols of differenf profocol suites 
fo operate at the same time. On fhe ofher hand, fhe IP address assigned fo a nefwork 
inferface is insfalled by fhe user or nefwork adminisfrafor and selecfed by fhaf 
person fo meef his or her needs. The IP addresses assigned fo a porfable device 
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may, for example, be changed when it is moved. IP addresses are typically derived 
from a pool of addresses maintained near the network attachment point and are 
installed when systems are turned on or configured (see Chapter 6). When an Eth¬ 
ernet frame containing an IP datagram is sent from one host on a LAN to another, 
it is the 48-bit Ethernet address that determines to which interface(s) the frame is 
destined. 

Address resolution is the process of discovering the mapping from one address 
to another. Eor the TCP/IP protocol suite using IPv4, this is accomplished by run¬ 
ning the ARP. ARP is a generic protocol, in the sense that it is designed to sup¬ 
port mapping between a wide variety of address types. In practice, however, it 
is almost always used to map between 32-bit IPv4 addresses and Ethernet-style 
48-bit MAC addresses. This case, the one specified in [REC0826], is also the one of 
interest to us. Eor this chapter, we shall use the terms Ethernet address and MAC 
address interchangeably. 

ARP provides a dynamic mapping from a network-layer address to a corre¬ 
sponding hardware address. We use the term dynamic because it happens auto¬ 
matically and adapts to changes over time without requiring reconfiguration 
by a system administrator. That is, if a host were to have its network interface 
card changed, thereby changing its hardware address (but retaining its assigned 
IP address), ARP would continue to operate properly after some delay. ARP 
operation is normally not a concern of either the application user or the system 
administrator. 


Note 

A related protocol that provides the reverse mapping from ARP, called RARP, was 
used by systems lacking a disk drive (normally diskless workstations or X termi¬ 
nals). It is rarely used today and requires manual configuration by the system 
administrator. See [RFC0903] for details. 


4.2 An Example 

Whenever we use Internet services, such as opening a Web page with a browser, 
our local computer must determine how to contact the server in which we are 
interested. The most basic decision it makes is whether that service is local (part 
of the same IP subnetwork) or remote. If it is remote, a router is required to reach 
the destination. ARP operates only when reaching those systems on the same IP 
subnet. Eor this example, then, let us assume that we use a Web browser to contact 
the following URL: 


http://10.0.0.1 

Note that this URL contains an IPv4 address rather than the more common 
domain or host name. The reason for using the address here is to underscore the 
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fact that our demonstration of ARP is most relevant to systems sharing the same 
IPv4 prefix (see Chapter 2). Here, we use a URL containing an address identifying 
a local Web server and explore how direct delivery operates. Such local servers are 
becoming more common as embedded devices such as printers and VoIP adapters 
include built-in Web servers for configuration. 

4.2.1 Direct Delivery and ARP 

In this section, we enumerate the steps taken in direct delivery, focusing on the 
operation of ARP. Direct delivery takes place when an IP datagram is sent to an 
IP address with the same IP prefix as the sender's. It plays an important role in 
the general method of forwarding of IP datagrams (see Chapter 5). The following 
list captures the basic operation of direct delivery with IPv4, using the previous 
example: 

1. The application, in this case a Web browser, calls a special function to parse 
the URL to see if it contains a host name. Here it does not, so the application 
uses the 32-bit IPv4 address 10.0.0.1. 

2. The application asks the TCP protocol to establish a connection with 10.0.0.1. 

3. TCP attempts to send a connection request segment to the remote host by 
sending an IPv4 datagram to 10.0.0.1. (We shall see the details of how this is 
done in Chapter 15.) 

4. Because we are assuming that the address 10.0.0.1 is using the same net¬ 
work prefix as our sending host, the datagram can be sent directly to that 
address without going through a router. 

5. Assuming that Ethernet-compatible addressing is being used on the IPv4 
subnet, the sending host must convert the 32-bit IPv4 destination address 
into a 48-bit Ethernet-style address. Using the terminology from [REC0826], 
a translation is required from the logical Internet address to its correspond¬ 
ing physical hardware address. This is the function of ARP. ARP works in 
its normal form only for broadcast networks, where the link layer is able to 
deliver a single message to all attached network devices. This is an impor¬ 
tant requirement imposed by the operation of ARP. On non-broadcast net¬ 
works (sometimes called NBMA for non-broadcast multiple access), other, 
more complex mapping protocols may be required [RPC2332]. 

6. ARP sends an Ethernet frame called an ARP request to every host on the 
shared link-layer segment. This is called a link-layer broadcast. We show the 
broadcast domain in Eigure 4-1 with a crosshatched box. The ARP request 
contains the IPv4 address of the destination host (10.0.0.1) and seeks an 
answer to the following question: "If you are configured with IPv4 address 
10.0.0.1 as one of your own, please respond to me with your MAC address." 
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Figure 4-1 Ethernet hosts in the same broadcast domain. ARP queries are sent using link-layer 
broadcast frames that are received by all hosts. The single host with the assigned 
address responds directly to the requesting host. Non-IP hosts must actively discard 
ARP queries. 

7. With ARP, all systems in the same broadcast domain receive ARP requests. 
This includes systems that may not be running the IPv4 or IPv6 protocols at 
all but does not include systems on different VLANs, if they are supported 
(see Chapter 3 for details on VLANs). Provided there exists an attached sys¬ 
tem using the IPv4 address specified in the request, it alone responds with 
an ARP reply. This reply contains the IPv4 address (for matching with the 
request) and the corresponding MAC address. The reply does not ordinar¬ 
ily use broadcast but is directed only to the sender. The host receiving the 
ARP request also learns of the sender's IPv4-to-MAC address mapping at 
this time and records it in memory for later use (see Section 4.3). 

8. The ARP reply is then received by the original sender of the request, and 
the datagram that forced the ARP request/reply to be exchanged can now 
be sent. 
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9. The sender now sends the datagram directly to the destination host by 
encapsulating it in an Ethernet frame and using the Ethernet address 
learned by the ARP exchange as the destination Ethernet address. Because 
the Ethernet address refers only fo fhe correcf desfinafion hosf, no ofher 
hosfs or roufers receive fhe dafagram. Thus, when only direcf delivery is 
used, no roufer is required. 

ARP is used in mulfi-access link-layer nefworks running IPv4, where each 
hosf has ifs own primary hardware address. Poinf-fo-poinf links such as PPP (see 
Chapfer 3) do nof use ARP. When fhese links are esfablished (normally by acfion 
of fhe user or a sysfem hoof), fhe sysfem is fold of fhe addresses in use af each 
end of fhe link. Because hardware addresses are nof involved, fhere is no need for 
address resolufion or ARP. 


4.3 ARP Cache 

Essenfial fo fhe efficienf operafion of ARP is fhe mainfenance of an ARP cache 
(or fable) on each hosf and roufer. This cache mainfains fhe recenf mappings 
from nefwork-layer addresses fo hardware addresses for each inferface fhaf uses 
address resolufion. When IPv4 addresses are mapped fo hardware addresses, fhe 
normal expirafion fime of an enfry in fhe cache is 20 minufes from fhe fime fhe 
enfry was creafed, as described in [RPC1122]. 

We can examine fhe ARP cache wifh fhe arp command on Linux or in Win¬ 
dows. The -a opfion displays all enfries in fhe cache for eifher sysfem. Running 
arp on Linux yields fhe following type of oufpuf: 


Linux% arp 

Address HWtype HWaddress Flags Mask Iface 

gw.home ether 00:0D:66:4F:60:00 C ethl 

printer.home ether 00:OA:95:87:38:6A C ethl 

Linux% arp -a 

printer.home (10.0.0.4) at 00:OA:95:87:38:6A [ether] on ethl 

gw.home (10.0.0.1) at 00:OD:66:4F:60:00 [ether] on ethl 

Running arp on Windows provides oufpuf similar fo fhe following: 


c:\> arp -a 

Interface: 10.0.0.56 - 0x2 

Internet Address Physical Address Type 

10.0.0.1 00-0d-66-4f-60-00 dynamic 

10.0.0.4 00-0a-95-87-38-6a dynamic 

Here we see fhe IPv4-fo-hardware addressing cache. In fhe firsf (Linux) case, 
each mapping is given by a five-elemenf enfry: fhe hosf name (corresponding fo 
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an IP address), hardware address type, hardware address, flags, and local net¬ 
work interface for which this mapping is active. The Flags column contains a 
symbol: C, M, or P. C-type entries have been learned dynamically by the ARP pro¬ 
tocol. M-type entries are entered by hand (by arp -s; see Section 4.9), and P-type 
entries mean "publish." That is, for any P entry, the host responds to incoming 
ARP requests with an ARP response. This option is used for configuring proxy 
ARP (see Section 4.7). The second Linux example displays similar information 
using the "BSD style." Here, both the host's name and address are given, along 
with the address type (here, [ether] indicates an Ethernet type of address) and 
on which interface the mappings are active. 

The Windows arp program displays the IPv4 address of the interface, and its 
interface number in hexadecimal (0x2 here). The Windows version also indicates 
whether the address was entered by hand or learned by ARP. In this example, both 
entries are dynamic, meaning they were learned by ARP (they would say static 
if entered by hand). Note that the 48-bit MAC addresses are displayed as six hexa¬ 
decimal numbers separated by colons in Linux and dashes in Windows. Tradi¬ 
tionally, UNIX systems have always used colons, whereas the IEEE standards and 
other operating systems tend to use dashes. We discuss additional features and 
other options of fhe arp command in Secfion 4.9. 


4.4 ARP Frame Format 

Eigure 4-2 shows fhe common formaf of an ARP requesf and reply packef, when 
used on an Ethernef nefwork to resolve an IPv4 address. (As mentioned previ¬ 
ously, ARP is general enough to be used with addresses other than IPv4 addresses, 
although this is very rare.) The first 14 bytes constitute the standard Ethernet 
header, assuming no 802.1p/q or other tags, and the remaining portion is defined 
by fhe ARP profocol. The firsf 8 byfes of fhe ARP frame are generic, and fhe remain¬ 
ing portion in this example applies specifically when mapping IPv4 addresses fo 
48-bif Ethernef-sfyle addresses. 
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Figure 4-2 ARP frame format as used when mapping IPv4 addresses to 48-bit MAC (Ethernet) addresses 
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In the Ethernet header of the ARP frame shown in Figure 4-2, the first two 
fields contain the destination and source Ethernet addresses. For ARP requests, the 
special Ethernet destination address of ff:ff:ff:ff:ff:ff (all 1 bits) means the broad¬ 
cast address—all Ethernet interfaces in the same broadcast domain receive these 
frames. The 2-byte Ethernet frame Length or Type field is required to be 0x0806 for 
ARP (requests or replies). 

The first four fields following the Length/Type field specify the types and sizes 
of the final four fields. The values are maintained by the lANA [RFC5494]. The 
adjectives hardware and protocol are used to describe the fields in the ARP packets. 
For example, an ARP request asks for the hardware address (an Ethernet address 
in this case) corresponding to a protocol address (an IPv4 address in this case). 
These adjectives are rarely used outside the ARP context. Rather, the more com¬ 
mon terminology for the hardware address is MAC, physical, or link-layer address 
(or Ethernet address when the network in use is based on the IEEE 802.3/Ether¬ 
net series of specifications). The Hard Type field specifies the type of hardware 
address. Its value is 1 for Ethernet. The Prot Type field specifies the type of protocol 
address being mapped. Its value is 0x0800 for IPv4 addresses. This is purposely 
the same value as the Type field of an Ethernet frame containing an IPv4 datagram. 
The next two 1-byte fields. Hard Size and Prot Size, specify the sizes, in bytes, of the 
hardware addresses and the protocol addresses. For an ARP request or reply for 
an IPv4 address on an Ethernet they are 6 and 4, respectively. The Op field speci¬ 
fies whether the operation is an ARP request (a value of 1), ARP reply (2), RARP 
request (3), or RARP reply (4). This field is required because the Length/Type field 
is the same for an ARP request and an ARP reply. 

The next four fields that follow are the Sender's Hardware Address (an Ethernet 
MAC address in this example), the Sender's Protocol Address (an IPv4 address), the 
Target Hardware (MAC/Ethernet) Address, and the Target Protocol (IPv4) Address. 
Notice that there is some duplication of information: the sender's hardware 
address is available both in the Ethernet header and in the ARP message. For an 
ARP request, all the fields are filled in except the Target Hardware Address (which is 
set to 0). When a system receives an ARP request directed to it, it fills in its hard¬ 
ware address, swaps the two sender addresses with the two target addresses, sets 
the Op field to 2, and sends the reply. 


4.5 ARP Examples 

In this section we will use the tcpdump command to see what really happens 
with ARP when we execute normal TCP/IP utilities such as Telnet. Telnet is a 
simple application that can establish a TCP/IP connection between two systems. 

4.5.1 Normal Example 

To see the operation of ARP, we will execute the telnet command, connecting to 
a Web server on host 10.0.0.3 using TCP port 80 (called www). 
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C:\> arp -a Verify that the ARP cache is empty 

No ARP Entries Found 

C:\> telnet 10.0.0.3 www Connect to the Web server [port 80] 

Connecting to 10.0.0.3... 

Escape character is 


Type Control + right bracket to get the Telnet client prompt. 


Welcome to Microsoft Telnet Client 
Escape Character is 'CTRL+]' 
Microsoft Telnet> quit 


The quit directive exits the program. 

While this is happening, we run the tcpdump command on another system 
that can observe the traffic exchanged. We use the -e option, which displays the 
MAC addresses (which in our examples are 48-bit Ethernet addresses). 

The following listing contains the output from tcpdump. We have deleted 
the final four lines of the output that correspond to the termination of the connec¬ 
tion (we cover such details in Chapter 13); they are not relevant to the discussion 
here. Note that different versions of tcpdump on different systems may provide 
slightly different output details. 


Linux# tcpdiunp -e 

1 0.0 0:0:c0:6f:2d:40 ff:ff:ff:ff:ff:ff arp 60: 
arp who-has 10.0.0.3 tell 10.0.0.56 

2 0.002174 (0.0022)0:0:c0:c2:9b:26 0:0:cO:6f:2d:40 arp 60: 
arp reply 10.0.0.3 is-at 0:0:cO:c2:9b:26 

3 0.002831 (0.0007)0:0:c0:6f:2d:40 0:0:cO:c2:9b:26 ip 60: 
10.0.0.56.1030 > 10.0.0.3. WWW: S 596459521:596459521(0) 
win 4096 <mss 1024> [tos 0x10] 

4 0.007834 (0.0050)0:0:c0:c2:9b:26 0:0:cO:6f:2d:40 ip 60: 

10.0.0.3. WWW > 10.0.0.56.1030: S 3562228225:3562228225(0) 
ack 596459522 win 4096 <mss 1024> 

5 0.009615 (0.0018)0:0:c0:6f:2d:40 0:0:cO:c2:9b:26 ip 60: 
10.0.0.56.1030 > 10.0.0.3.discard: . ack 1 win 4096 [tos 0x10] 

In packet 1 the hardware address of the source is 0:0:c0:6f:2d:40. The des¬ 
tination hardware address isff:ff:ff:ff:ff:ff, which is the Ethernet broadcast 
address. All Ethernet interfaces in the same broadcast domain (all those on the 
same LAN or VLAN, whether or not they are running TCP/IP) receive the frame 
and process it, as shown in Pigure 4-1. The next output field in packet 1, arp, 
means that the Frame Type field is 0x0806, specifying either an ARP request or an 
ARP reply. The value 60 printed after the words arp and ip in each of the five 
packets is the length of the Ethernet frame. The size of an ARP request or ARP 
reply is always 42 bytes (28 bytes for the ARP message, 14 bytes for the Ethernet 
header). Each frame has been padded to the Ethernet minimum: 60 bytes of data 
plus a 4-byte CRC (see Chapter 3). 
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The next part of packet 1, arp who-has, identifies the frame as an ARP request 
with the IPv4 address of 10.0.0.3 as the target address and the IPv4 address of 
10.0.0.56 as the sender's address, tcpdump prints the host names corresponding 
to the IP addresses by default, but here they are not displayed (because no reverse 
DNS mappings for them are set up; Chapter 11 explains details of DNS). We will 
use the -n option later to see the IP addresses in the ARP request, whether or not 
DNS mappings are available. 

From packet 2 we see that while the ARP request is broadcast, the destination 
address of the ARP reply is the (unicast) MAC address 0:0:c0:6f:2d:40. The 
ARP reply is thus sent directly to the requesting host; it is not ordinarily broad¬ 
cast (see Section 4.8 for some cases where this rule is altered), tcpdump prints 
the ARP reply for this frame, along with the IPv4 address and hardware address 
of the responder. Line 3 is the first TCP segment requesting that a connection be 
established. Its destination hardware address is the destination host (10.0.0.3). 
We shall cover the details of this segment in Chapter 13. 

For each packet, the number printed after the packet number is the relative 
time (in seconds) when the packet was received by tcpdump. Each packet other 
than the first also contains the time difference (in seconds) from the previous time, 
in parentheses. We can see in the output that the time between sending the ARP 
request and receiving the ARP reply is about 2.2ms. The first TCP segment is sent 
0.7ms after this. The overhead involved in using ARP for dynamic address resolu¬ 
tion in this example is less than 3ms. Note that if the ARP entry for host 10.0.0.3 
was valid in the ARP cache at 10.0.0.56, the initial ARP exchange would not have 
occurred, and the initial TCP segment could have been sent immediately using the 
destination's Ethernet address. 

A subtle point about the tcpdump output is that we do not see an ARP request 
from 10.0.0.3 before it sends its first TCP segment to 10.0.0.56 (line 4). While it 
is possible that 10.0.0.3 already has an entry for 10.0.0.5 6 in its ARP cache, nor¬ 
mally when a system receives an ARP request addressed to it, in addition to send¬ 
ing the ARP reply, it also saves the requestor's hardware address and IPv4 address 
in its own ARP cache. This is an optimization based on the logical assumption that 
if the requestor is about to send it a datagram, the receiver of the datagram will 
probably send a reply. 

4.5.2 ARP Request to a Nonexistent Host 

What happens if the host specified in an ARP request is down or nonexistent? To 
see this, we attempt to access a nonexistent local IPv4 address—the prefix corre¬ 
sponds to that of the local subnet, but there is no host with the specified address. 
We will use the IPv4 address 10.0.0.99 in this example. 

Linux% date ; telnet 10.0.0.99 ; date 

Fri Jan 29 14:46:33 PST 2010 
Trying 10.0.0.99... 

telnet: connect to address 10.0.0.99: No route to host 
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Fri Jan 29 14:46:36 PST 2010 3s after previous date 

Linux% arp -a 

? (10.0.0.99) at <incomplete> on ethO 
Here is the output from tcpdump: 


Linux# tcpd\unp -n arp 

1 21:12:07.440845 arp who-has 10.0.0.99 tell 10.0.0.56 

2 21:12:08.436842 arp who-has 10.0.0.99 tell 10.0.0.56 

3 21:12:09.436836 arp who-has 10.0.0.99 tell 10.0.0.56 

This time we did not specify fhe -e opfion because we already know fhaf 
fhe ARP requesfs are senf using broadcasf addressing. The frequency of fhe ARP 
requesf is very close fo one per second, fhe maximum suggesfed by [RFC1122]. 
Tesfing on a Windows sysfem (nof illusfrafed) reveals a differenf behavior. Rafher 
fhan fhree requesfs spaced Is aparf, fhe spacing varies based on fhe applicafion 
and fhe ofher profocols being used. For ICMP and UDP (see Chapfers 8 and 10, 
respecfively), a spacing of approximafely 5s is used, whereas for TCP 10s is used. 
For TCP, fhe 10s inferval allows fwo ARP requesfs fo be senf wifhouf responses 
before TCP gives up frying fo esfablish a connecfion. 


4.6 ARP Cache Timeout 

A fimeouf is normally associafed wifh each enfry in fhe ARP cache. (Lafer we 
shall see fhaf fhe arp command enables fhe adminisfrafor fo place an enfry info 
fhe cache fhaf will never fime ouf.) Mosf implemenfafions have a fimeouf of 20 
minufes for a complefed enfry and 3 minufes for an incomplefe enfry. (We saw an 
incomplefe enfry in our previous example where we forced an ARP fo a nonexis- 
fenf hosf.) These implemenfafions normally resfarf fhe 20-minufe fimeouf for an 
enfry each fime fhe enfry is used. [RFC1122], fhe Hosf Requiremenfs RFC, says 
fhaf fhis fimeouf should occur even if fhe enfry is in use, buf many implemenfa¬ 
fions do nof do fhis—fhey resfarf fhe fimeouf each fime fhe enfry is referenced. 

Nofe fhaf fhis is one of our firsf examples of soft state. Soff sfafe is informafion 
fhaf is discarded if nof refreshed before some fimeouf is reached. Many Infernef 
profocols use soff sfafe because if helps fo inifiafe aufomafic reconfigurafion if nef- 
work condifions change. The cosf of soff sfafe is fhaf some protocol musf refresh fhe 
sfafe fo avoid expirafion. "Soff sfafe refreshes" are offen incorporated in a protocol 
design fo keep fhe soff sfafe acfive. 


4.7 Proxy ARP 

Proxy ARP [RFC1027] lefs a sysfem (generally a specially configured router) 
answer ARP requesfs for a differenf hosf. This fools fhe sender of fhe ARP requesf 
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into thinking that the responding system is the destination host, when in fact the 
destination host may be elsewhere (or may not exist). Proxy ARP is not commonly 
used and is generally to be avoided if possible. 

Proxy ARP has also been called promiscuous ARP or the ARP hack. These 
names are from a historical use of proxy ARP: to hide two physical networks from 
each other. In this case both physical networks can use the same IP prefix as long 
as a router in the middle is configured as a proxy ARP agent to respond to ARP 
requests on one network for a host on the other network. This technique can be 
used to "hide" one group of hosts from another. In the past, there were two com¬ 
mon reasons for doing this: some systems were unable to handle subnetting, and 
some used an older broadcast address (a host ID of all 0 bits, instead of the current 
standard of a host ID with all 1 bits). 

Linux supports a feature called auto-proxy ARP. It can be enabled by writing 
the character 1 into the file /proc/sys/net/ipv4/conf/*/proxy_arp, or by 
using the sysctl command. This supports the ability of using proxy ARP with¬ 
out having to manually enter ARP entries for every possible IPv4 address that is 
being proxied. Doing so allows a range of addresses, instead of each individual 
address, to be automatically proxied. 


4.8 Gratuitous ARP and Address Conflict Detection (ACD) 

Another feature of ARP is called gratuitous ARP. It occurs when a host sends an 
ARP request looking for its own address. This is usually done when the interface 
is configured "up" at bootstrap time. Here is an example trace taken on a Linux 
machine showing our Windows host booting up: 


Linux# tcpdiunp -e -n arp 

I 0.0 0:0:c0:6f:2d:40 ff:ff:ff:ff:ff:ff arp 60: 

arp who-has 10.0.0.56 tell 10.0.0.56 

(We specified the -n flag for tcpdump to always print numeric dotted-deci¬ 
mal addresses instead of host names.) In terms of the fields in the ARP request, the 
Sender's Protocol Address and the Target Protocol Address are identical: 10.0.0.56. 
Also, the Source Address field in the Ethernet header, 0:0:c0:6f:2d:40 as shown 
by tcpdump, equals the sender's hardware address. Gratuitous ARP achieves two 
goals: 

1. It lets a host determine if another host is already configured with the same 
IPv4 address. The host sending the gratuitous ARP is not expecting a reply 
to its request. If a reply is received, however, the error message "Duplicate 
IP address sent from Ethernet address . . ." is usually displayed. This is a 
warning to the system administrator and user that one of the systems in the 
same broadcast domain (e.g., LAN or VLAN) is misconfigured. 



176 


ARP: Address Resolution Protocol 


2. If the host sending the gratuitous ARP has just changed its hardware 
address (perhaps the host was shut down, the interface card was replaced, 
and then the host was rebooted), this frame causes any other host receiving 
the broadcast that has an entry in its cache for the old hardware address 
to update its ARP cache entry accordingly As mentioned before, if a host 
receives an ARP request from an IPv4 address that is already in the receiv¬ 
er's cache, that cache entry is updated with the sender's hardware address 
from the ARP request. This is done for any ARP request received by the 
host; gratuitous ARP happens to take advantage of this behavior. 

Although gratuitous ARP provides some indication that multiple stations may 
be attempting to use the same IPv4 address, it really provides no mechanism to 
react to the situation (other than by printing a message that is ideally acted upon by 
a system administrator). To deal with this issue, [RFC5227] describes IPv4 Address 
Conflict Detection (ACD). ACD defines ARP probe and ARP announcement pack¬ 
ets. An ARP probe is an ARP request packet in which the Sender's Protocol (IPv4) 
Address field is set to 0. Probes are used to see if a candidate IPv4 address is being 
used by any other systems in the broadcast domain. Setting the Sender's Protocol 
Address field to 0 avoids cache pollution should the candidate IPv4 address already 
be in use by another host, a difference from the way gratuitous ARP works. An 
ARP announcement is identical to an ARP probe, except both the Sender's Protocol 
Address and the Target Protocol Address fields are filled in with the candidate IPv4 
address. It is used to announce the sender's intention to use the candidate IPv4 
address as its own. 

To perform ACD, a host sends an ARP probe when an interface is brought up 
or out of sleep, or when a new link is established (e.g., when an association with 
a new wireless network is made). It first waits a random amount of time (in the 
range 0-ls, distributed uniformly) before sending up to three probe packets. The 
delay is used to avoid power-on congestion when multiple systems powered on 
simultaneously would otherwise attempt to perform ACD at once, leading to a 
network traffic spike. The probes are spaced randomly, with between 1 and 2s of 
delay (distributed uniformly) placed between. 

While sending its probes, a requesting station may receive ARP requests or 
replies. A reply to its probe indicates that a different station is already using the 
candidate IP address. A request containing the same candidate IPv4 address in the 
Target Protocol Address field sent from a different system indicates that the other 
system is simultaneously attempting to acquire the candidate IPv4 address. In 
either case, the system should indicate an address conflict message and pursue 
some alternative address. For example, this is the recommended behavior when 
being assigned an address using DHCP (see Chapter 6). [RFC5227] places a limit of 
ten conflicts when trying to acquire an address before the requesting host enters a 
rate-limiting phase when it is permitted to perform ACD only once every 60s until 
successful. 
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If a requesting host does not discover a conflict according to the procedure 
just described, it sends two ARP announcements spaced 2s apart to indicate to sys¬ 
tems in the broadcast domain the IPv4 address it is now using. In the announce¬ 
ments, both the Sender's Protocol Address and the Target Protocol Address fields are 
set to the address being claimed. The purpose of sending these announcements is 
to ensure that any preexisting cached address mappings are updated to reflect the 
sender's current use of the address. 

ACD is considered to be an ongoing process, and in this way it differs from 
gratuitous ARP. Once a host has announced an address it is using, it continues 
inspecting incoming ARP traffic (requests and replies) to see if its address appears 
in the Sender’s Protocol Address field. If so, some other system believes it is rightfully 
using the same address. In this case, [RFC5227] provides three possible resolution 
mechanisms: cease using the address, keep the address but send a "defensive" 
ARP announcement and cease using it if the conflict continues, or continue to 
use the address despite the conflict. The last option is recommended only for sys¬ 
tems that truly require a fixed, stable address (e.g., an embedded device such as a 
printer or router). 

[RFC5227] also suggests the potential benefit of having some ARP replies be 
sent using link-layer broadcast. Although this has not traditionally been the way 
ARP works, there can be some benefit in doing so, at the expense of requiring all 
stations on the same segment to process all ARP traffic. Broadcast replies allow 
ACD to occur more quickly because all stations will notice the reply and invali¬ 
date their caches during a conflict. 


4.9 The arp Command 

We have used the arp command with the -a flag on Windows and Linux to dis¬ 
play all the entries in the ARP cache (on Linux we get similar information without 
using -a). The superuser or administrator can specify the -d option to delete an 
entry from the ARP cache. (This was used before running a few of the examples, 
to force an ARP exchange to be performed.) 

Entries can also be added using the -s option. It requires an IPv4 address (or 
host name that can be converted to an IPv4 address using DNS) and an Ethernet 
address. The IPv4 address and the Ethernet address are added to the cache as an 
entry. This entry is made semipermanent (i.e., it does not time out from the cache, 
but it disappears when the system is rebooted). 

The Linux version of arp provides a few more features than the Windows 
version. When the temp keyword is supplied at the end of the command line 
when adding an entry using -s, the entry is considered to be temporary and times 
out in the same way that other ARP entries do. The keyword pub at the end of a 
command line, also used with the -s option, causes the system to act as an ARP 
responder for that entry. The system answers ARP requests for the IPv4 address, 
replying with the specified Ethernet address. If the advertised address is one of 
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the system's own, the system is acting as a proxy ARP agent (see Section 4.7) for 
the specified IPv4 address. If arp -s is used to enable proxy ARP, Linux responds 
for the address specified even if the file /proc/sys/net/ipv4/conf/*/proxy_ 
arp contains 0. 


4.10 Using ARP to Set an Embedded Device’s IPv4 Address 

As more embedded devices are made compatible with Ethernet and the TCP/IP 
protocols, it is increasingly common to find network-attached devices that have 
no direct way to enter their network configuration information (e.g., they have no 
keyboard, so entering an IP address for them to use is not possible). These devices 
are typically configured in one of two ways. First, DHCP can be used to automati¬ 
cally assign an address and other information (see Chapter 6). Another way is to 
use ARP to set an IPv4 address, although this method is less common. 

Using ARP to configure an embedded device's IPv4 address was not the origi¬ 
nal intent of the protocol, so it is not entirely automatic. The basic idea is to manu¬ 
ally establish an ARP mapping for the device (using the arp -s command), then 
send an IP packet to the address. Because the ARP entry is already present, no 
ARP request/reply is generated. Instead, the hardware address can be used imme¬ 
diately. Of course, the Ethernet (MAC) address of the device must be known. It is 
typically printed on the device itself and sometimes doubles as the manufacturer's 
device serial number. When the device receives a packet destined for its hardware 
address, whatever destination address is contained in the datagram is used to 
assign its initial IPv4 address. After that, the device can be fully configured using 
other means (e.g., by an embedded Web server). 


4.11 Attacks Involving ARP 

There have been a series of attacks involving ARP. The most straightforward is 
to use the proxy ARP facility to masquerade as some host, responding to ARP 
requests for it. If the victim host is not present, this is straightforward and may not 
be detected. It is considerably more difficult if the host is still running, as more 
than one response may be generated per ARP request, which is easily detected. 

A more subtle attack has been launched against ARP that involves cases where 
a machine is attached to more than one network, and ARP entries from one inter¬ 
face "leak" over into the ARP table of the other, because of a bug in the ARP soft¬ 
ware. This can be exploited to improperly direct traffic onto the wrong network 
segment. Linux provides a way to affect this behavior directly, by modifying the 
file /proc/sys/net/ipv4/conf/*/arp_fliter. If the value 1 is written into 
this file, then when an incoming ARP request arrives over an interface, an IP for¬ 
warding check is made. The IP address of the requestor is looked up to determine 
which interface would be used to send IP datagrams back to it. If the interface 
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used by the arriving ARP request is different from the interface that would be 
used to return an IP datagram to the requestor, the ARP response is suppressed 
(and the triggering ARP request is dropped). 

A somewhat more damaging attack on ARP involves the handling of static 
entries. As discussed previously, static entries may be used to avoid the ARP 
request/reply when seeking the Ethernet (MAC) address corresponding to a par¬ 
ticular IP address. Such static entries have been used in an attempt to enhance 
security. The idea is that static entries placed in the ARP cache for important hosts 
would soon detect any hosts masquerading with that IP address. Unfortunately, 
most implementations of ARP have traditionally replaced even static cache entries 
with entries provided by ARP replies. The consequence of this is that a machine 
receiving an ARP reply (even if did not send an ARP request) would be coaxed 
into replacing its static entries with those provided by an attacker. 


4.12 Summary 

ARP is a basic protocol in almost every TCP/IP implementation, but it normally 
does its work without the application or user being aware of it. ARP is used to 
determine the hardware addresses corresponding to the IPv4 addresses in use on 
the locally reachable IPv4 subnet. It is invoked when forwarding datagrams des¬ 
tined for the same subnet as the sending host's and is also used to reach a router 
when the destination of a datagram is not on the subnet (the details of this are 
explained in Chapter 5). The ARP cache is fundamental to its operation, and we 
have used the arp command to examine and manipulate the cache. Each entry 
in the cache has a timer that is used to remove both incomplete and completed 
entries. The arp command displays and modifies entries in the ARP cache. 

We followed through the normal operation of ARP along with specialized 
versions: proxy ARP (when a router answers ARP requests for hosts accessible on 
another of the router's interfaces) and gratuitous ARP (sending an ARP request for 
your own IP address, normally when bootstrapping). We also discussed address 
conflict detection for IPv4, which uses a continually operating gratuitous ARP-like 
exchange to avoid address duplication within the same broadcast domain. Pinally, 
we discussed a number of attacks that involve ARP. Most of these involve imper¬ 
sonating hosts by fabricating ARP responses for them. This can lead to problems 
with higher-layer protocols if they do not implement strong security (see Chapter 
18). 
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5.1 Introduction 

IP is the workhorse protocol of the TCP/IP protocol suite. All TCP, UDP, ICMP, and 
ICMP data gets transmitted as IP datagrams. IP provides a best-effort, connection¬ 
less datagram delivery service. By "best-effort" we mean there are no guarantees 
that an IP datagram gets to its destination successfully. Alfhough IP does nof sim¬ 
ply drop all fraffic unnecessarily, if provides no guaranfees as fo fhe fafe of fhe 
packefs if affempfs fo deliver. When somefhing goes wrong, such as a roufer fem- 
porarily running ouf of buffers, IP has a simple error-handling algorifhm: fhrow 
away some dafa (usually fhe lasf dafagram fhaf arrived). Any required reliabilify 
musf be provided by fhe upper layers (e.g., TCP). IPv4 and IPv6 bofh use fhis basic 
besf-efforf delivery model. 

The ferm connectionless means fhaf IP does nof mainfain any connecfion sfafe 
informafion abouf relafed dafagrams wifhin fhe nefwork elemenfs (i.e., wifhin fhe 
roufers); each dafagram is handled independenfly from all ofher ofhers. This also 
means fhaf IP dafagrams can be delivered ouf of order. If a source sends fwo con- 
secufive dafagrams (firsf A, fhen B) fo fhe same desfinafion, each is roufed inde¬ 
pendenfly and can fake differenf pafhs, and B may arrive before A. Ofher fhings 
can happen fo IP dafagrams as well: fhey may be duplicafed in fransif, and fhey 
may have fheir dafa alfered as fhe resulf of errors. Again, some protocol above IP 
(usually TCP) has fo handle all of fhese pofenfial problems in order fo provide an 
error-free delivery absfracfion for applicafions. 

In fhis chapfer we fake a look af fhe fields in fhe IPv4 (see Figure 5-1) and 
IPv6 (see Figure 5-2) headers and describe how IP forwarding works. The official 
specificafion for IPv4 is given in [RFC0791]. A series of RFCs describe IPv6, sfarf- 
ing wifh [RFC2460]. 
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Figure 5-1 The IPv4 datagram. The header is of variable size, limited to fifteen 32-bit words (60 
bytes) by the 4-bit IHL field. A typical IPv4 header contains 20 bytes (no options). The 
source and destination addresses are 32 bits long. Most of the second 32-bit word is used 
for the IPv4 fragmentation function. A header checksum helps ensure that the fields in 
the header are delivered correctly to the proper destination but does not protect the data. 


0 

(2 bits) 

1516 

31 

Version 

DSField 

E 

Flow Label 

(4 bits) 

(6 bits) 

N 

(20 bits) 



Payload Length 

Next Header 

Hop Limit 


(16 bits) 


(8 bits) 
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Source IP Address 
(128 bits) 


Destination IP Address 
(128 bits) 


IPv6 
Header 
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Figure 5-2 The IPv6 header is of fixed size (40 bytes) and contains 128-bit source and destination 
addresses. The Next Header field is used to indicate the presence and types of additional 
extension headers that follow the IPv6 header, forming a daisy chain of headers that may 
include special extensions or processing directives. Application data follows the header 
chain, usually immediately following a transport-layer header. 
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5.2 IPv4 and IPv6 Headers 

Figure 5-1 shows the format of an IPv4 datagram. The normal size of the IPv4 
header is 20 bytes, unless options are present (which is rare). The IPv6 header is 
twice as large but never has any options. It may have extension headers, which pro¬ 
vide similar capabilities, as we shall see later. In our pictures of headers and data¬ 
grams, the most significant bit is numbered 0 at the left, and the least significant 
bit of a 32-bit value is numbered 31 on the right. 

The 4 bytes in a 32-bit value are transmitted in the following order: bits 0-7 
first, then bits 8-15, then 16-23, and bits 24-31 last. This is called big endian byte 
ordering, which is the byte ordering required for all binary integers in the TCP/IP 
headers as they traverse a network. It is also called network byte order. Computer 
CPUs that store binary integers in other formats, such as the little endian format 
used by most PCs, must convert the header values into network byte order for 
transmission and back again for reception. 

5.2.1 IP Header Fields 

The first field (only 4 bits or one nibble wide) is the Version field. It contains the 
version number of the IP datagram: 4 for IPv4 and 6 for IPv6. The headers for both 
IPv4 and IPv6 share the location of the Version field but no others. Thus, the two 
protocols are not directly interoperable—a host or router must handle either IPv4 
or IPv6 (or both, called dual stack) separately. Although other versions of IP have 
been proposed and developed, only versions 4 and 6 have any significant amount 
of use. The I ANA keeps an official registry of these version numbers [IV]. 

The Internet Header Length {IHL) field is the number of 32-bit words in the IPv4 
header, including any options. Because this is also a 4-bit field, the IPv4 header is 
limited to a maximum of fifteen 32-bit words or 60 bytes. Later we shall see how 
this limitation makes some of the options, such as the Record Route option, nearly 
useless today. The normal value of this field (when no options are present) is 5. 
There is no such field in IPv6 because the header length is fixed at 40 bytes. 

Following the header length, the original specification of IPv4 [RFC0791] 
specified a Type of Service (ToS) byte, and IPv6 [RFC2460] specified the equivalent 
Traffic Class byte. Use of these never became widespread, so eventually this 8-bit 
field was split into two smaller parts and redefined by a set of RFCs ([RFC3260] 
[RFC3168][RFC2474] and others). The first 6 bits are now called the Differentiated 
Services Field {DS Field), and the last 2 bits are the Explicit Congestion Notification 
(ECN) field or indicator bits. These RFCs now apply to both IPv4 and IPv6. These 
fields are used for special processing of the datagram when it is forwarded. We 
discuss them in more detail in Section 5.2.3. 

The Total Length field is the total length of the IPv4 datagram in bytes. Using 
this field and the IHL field, we know where the data portion of the datagram 
starts, and its length. Because this is a 16-bit field, the maximum size of an IPv4 
datagram (including header) is 65,535 bytes. The Total Length field is required in 
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the header because some lower-layer protocols that carry IPv4 datagrams do not 
(accurately) convey the size of encapsulated datagrams on their own. Ethernet, 
for example, pads small frames fo be a minimum lengfh (64 byfes). Even fhough 
fhe minimum Efhernef payload size is 46 byfes (see Chapfer 3), an IPv4 dafagram 
can be smaller (as few as 20 byfes). If fhe Total Length field were nof provided, fhe 
IPv4 implemenfafion would nof know how much of a 46-byfe Efhernef frame was 
really an IP dafagram, as opposed fo padding, leading fo possible confusion. 

Alfhough if is possible fo send a 65,535-byfe IP dafagram, mosf link layers 
(such as Efhernef) are nof able fo carry one fhis large wifhouf fragmenfing if 
(chopping if up) info smaller pieces. Eurfhermore, a hosf is nof required fo be able 
fo receive an IPv4 dafagram larger fhan 576 byfes. (In IPv6 a hosf musf be able fo 
process a dafagram af leasf as large as fhe MTU of fhe link fo which if is attached, 
and fhe minimum link MTU is 1280 byfes.) Many applicafions fhaf use fhe UDP 
profocol (see Chapfer 10) for dafa fransporf (e.g., DNS, DHCP, efc.) use a limifed 
dafa size of 512 byfes fo avoid fhe 576-byfe IPv4 limif. TCP chooses ifs own dafa¬ 
gram size based on addifional informafion (see Chapfer 15). 

When an IPv4 dafagram is fragmenfed info mulfiple smaller fragmenfs, each of 
which ifself is an independenf IP dafagram, fhe Total Length field reflecfs fhe lengfh 
of fhe parficular fragmenf. Pragmenfafion is described in defail along wifh UDP in 
Chapfer 10. In IPv6, fragmenfafion is nof supporfed by fhe header, and fhe lengfh 
is insfead given by fhe Payload Length field. This field measures fhe lengfh of fhe 
IPv6 dafagram not including fhe lengfh of fhe header; exfension headers, however, 
are included in fhe Payload Length field. As wifh IPv4, fhe 16-bif size of fhe field 
limifs ifs maximum value fo 65,535. Wifh IPv6, however, if is fhe payload lengfh fhaf 
is limifed fo 64KB, nof fhe enfire dafagram. In addifion, IPv6 supporfs ajumbogram 
opfion (see Secfion 5.3.1.2) fhaf provides for fhe possibilify, af leasf fheorefically, of 
single packefs wifh payloads as large as 4GB (4,294,967,295 byfes)! 

The Identification field helps indenfify each dafagram senf by an IPv4 hosf. To 
ensure fhaf fhe fragmenfs of one dafagram are nof confused wifh fhose of anofher, 
fhe sending hosf normally incremenfs an infernal counfer by 1 each fime a dafagram 
is senf (from one of ifs IP addresses) and copies fhe value of fhe counfer info fhe IPv4 
Identification field. This field is mosf imporfanf for implemenfing fragmenfafion, so 
we explore if furfher in Chapfer 10, where we also discuss fhe Flags and Fragment 
Offset fields. In IPv6, fhis field shows up in fhe Pragmenfafion exfension header, as 
we discuss in Secfion 5.3.3. 

The Time-to-Live field, or TTL, sefs an upper limif on fhe number of roufers 
fhrough which a dafagram can pass. If is inifialized by fhe sender fo some value 
(64 is recommended [RPC1122], alfhough 128 or 255 is nof uncommon) and decre- 
menfed by 1 by every roufer fhaf forwards fhe dafagram. When fhis field reaches 
0, fhe dafagram is fhrown away, and fhe sender is nofified wifh an ICMP message 
(see Chapfer 8). This prevenfs packefs from geffing caughf in fhe nefwork forever 
should an unwanfed roufing loop occur. 
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Note 

The TTL field was originally specified to be the maximum lifetime of an IP data¬ 
gram in seconds, but routers were also always required to decrement the value by 
at least 1. Because virtually no routers today hold on to a datagram longer than Is 
under normal operation, the earlier rule is now ignored or forgotten, and in IPv6 
the field has been renamed to its de facto use: Hop Limit. 


The Protocol field in the IPv4 header contains a number indicating the type of 
data found in the payload portion of the datagram. The most common values are 
17 (for UDP) and 6 (for TCP). This provides a demultiplexing feature so that the IP 
protocol can be used to carry payloads of more than one protocol type. Although 
this field originally specified the transport-layer protocol the datagram is encap¬ 
sulating, it is now understood to identify the encapsulated protocol, which may or 
not be a transport protocol. For example, other encapsulations are possible, such 
as IPv4-in-IPv4 (value 4). The official list of the possible values of the Protocol field 
is given in the assigned numbers page [AN]. The Next Header field in the IPv6 
header generalizes the Protocol field from IPv4. It is used to indicate the type of 
header following the IPv6 header. This field may contain any values defined for 
the IPv4 Protocol field, or any of the values associated with the IPv6 extension 
headers described in Section 5.3. 

The Header Checksum field is calculated over the IPv4 header only. This is impor¬ 
tant to understand because it means that the payload of the IPv4 datagram (e.g., 
TCP or UDP data) is not checked for correctness by the IP protocol. To help ensure 
that the payload portion of an IP datagram has been correctly delivered, other 
protocols must cover any important data that follows the header with their own 
data-integrity-checking mechanisms. We shall see that almost all protocols encap¬ 
sulated in IP (ICMP, ICMP, UDP, and TCP) have a checksum in their own headers 
to cover their header and data and also to cover certain parts of the IP header they 
deem important (a form of "layering violation"). Perhaps surprisingly, the IPv6 
header does not have any checksum field. 


Note 

Omitting the checksum field from the IPv6 header was a somewhat controversial 
decision. The reasoning behind this action is roughly as follows: Higher-layer pro¬ 
tocols requiring correctness in the IP header are required to compute their own 
checksums over the data they believe to be important. A consequence of errors 
in the IP header is that the data is delivered to the wrong destination, is indicated 
to have come from the wrong source, or is otherwise mangled during delivery. 
Because bit errors are relatively rare (thanks to fiber-optic delivery of Internet 
traffic) and stronger mechanisms are available to ensure correctness of the other 
fields (higher-layer checksums or other checks), it was decided to eliminate the 
field from the IPv6 header. 
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The algorithm used in computing a checksum is also used by most of the 
other Internet-related protocols that use checksums and is sometimes known as 
the Internet checksum. Note that when an IPv4 datagram passes through a router, 
its header checksum must change as a result of decremenfing fhe TTL field. We 
discuss fhe mefhods for compufing fhe checksum in more defail in Secfion 5.2.2. 

Every IP dafagram confains fhe Source IP Address of fhe sender of fhe dafagram 
and fhe Destination IP Address of where fhe dafagram is desfined. These are 32-bif 
values for IPv4 and 128-bif values for IPv6, and fhey usually idenfify a single infer- 
face on a compufer, alfhough mulficasf and broadcasf addresses (see Chapfer 2) 
violafe fhis rule. While a 32-bif address can accommodafe a seemingly large num¬ 
ber of Infernef enfifies (4.5 billion), fhere is widespread agreemenf fhaf fhis num¬ 
ber is inadequafe, a primary mofivafion for moving fo IPv6. The 128-bif address 
of IPv6 can accommodafe a huge number of Infernef enfifies. As was resfafed in 
[H05], IPv6 has 3.4 x 10®* (340 undecillion) addresses. Quofing from [H05] and ofh- 
ers: "The opfimisfic esfimafe would allow for 3,911,873,538,269,506,102 addresses 
per square mefer of fhe surface of fhe planef Earfh." If cerfainly seems as if fhis 
should lasf a very, very long fime indeed. 

5.2.2 The Internet Checksum 

The Internet checksum is a 16-bit mathematical sum used to determine, with 
reasonably high probability, whether a received message or portion of a message 
matches the one sent. Note that the Internet checksum algorithm is not the same as 
the common cyclic redundancy check (CRC) [PB61], which offers stronger protection. 

To compute the IPv4 header checksum for an outgoing datagram, the value 
of the datagram's Checksum field is first set to 0. Then, the 16-bit one's comple¬ 
ment sum of the header is calculated (the entire header is considered a sequence 
of 16-bit words). The 16-bit one's complement of this sum is then stored in the 
Checksum field to make the datagram ready for transmission. One's complement 
addition can be implemented by "end-round-carry addition": when a carry bit 
is produced using conventional (two's complement) addition, the carry is added 
back in as a 1 value. Pigure 5-3 presents an example, where the message contents 
are represented in hexadecimal. 

When an IPv4 datagram is received, a checksum is computed across the whole 
header, including the value of the Checksum field itself. Assuming there are no 
errors, the computed checksum value is always 0 (a one's complement of the value 
PEEP). Note that for any nontrivial packet or header, the value of the Checksum 
field in the packet can never be PEEP. If it were, the sum (prior to the final one's 
complement operation at the sender) would have to have been 0. No sum can ever 
be 0 using one's complement addition unless all the bytes are 0—something that 
never happens with any legitimate IPv4 header. When the header is found to be 
bad (the computed checksum is nonzero), the IPv4 implementation discards the 
received datagram. No error message is generated. It is up to the higher layers to 
somehow detect the missing datagram and retransmit if necessary. 
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Sending 

Message: 

Two’s Complement Sum: 
One’s Complement Sum: 
One’s Complement: 


E3 4F 23 96 44 27 99 F3 [00 
1E4FF 

E4FF+1 = E500 
~(E500) =-(1110 0101 0000 
1AFF (the checksum) 


00 ] 


Checksum Field - 0000 


0000 ) = 0001 1010 1111 1111 = 


Receiving 

Message + Checksum = E34F + 2396 + 4427 + 99F3 + 1AFF = E500 + 1AFF = FFFF 

-(Message + Checksum) = 0000 

Figure 5-3 The Internet checksum is the one's complement of a one's complement 16-bit sum of the 
data being checksummed (zero padding is used if the number of bytes being summed is 
odd). If the data being summed includes a Checksum field, the field is first set to 0 prior 
to the checksum operation and then filled in with the computed checksum. To check 
whether an incoming block of data that contains a Checksum field (header, payload, etc.) 
is valid, the same type of checksum is computed over the whole block (including the 
Checksum field). Because the Checksum field is essentially the inverse of the checksum of 
the rest of the data, computing the checksum on correctly received data should produce 
a value of 0. 


5.2.2.1 Mathematics of the Internet Checksum 

For the mathematically inclined, the set of 16-bit hexadecimal values V = {0001, 
. . ., FFFF} and the one's complement sum operation + together form an Abelian 
group. For fhe combinafion of a sef and an operator fo be a group, several proper- 
fies need fo be obeyed: closure, associafivify, existence of an idenfify elemenf, and 
exisfence of inverses. To be an Abelian (commufafive) group, commufafivify musf 
also be obeyed. If we look closely, we see fhaf all of fhese properfies are indeed 
obeyed: 

• For any X,Y in V, (X + Y) is in V [closure] 

• For any X,Y,Z in V, X -f (Y -f Z) = (X -f Y) -f Z [associafivify] 

• For any XinV, e-i-X = X-i-e = X where e = FFFF [idenfify] 

• For any X in V, fhere is an X' in V such fhaf X -f X' = e [inverse] 

• For any X,Y in V, (X -f Y) = (Y -f X) [commufafivify] 

Whaf is inferesfing abouf fhe sef V and fhe group <V,-i-> is fhaf we have delefed 
fhe number 0000 from considerafion. If we puf fhe number 0000 in fhe sef V, fhen 
<V,-i-> is nof a group any longer. To see fhis, we firsf observe fhaf 0000 and FFFF 
appear fo perform fhe role of zero (addifive idenfify) using fhe -i- operafion. For 
example, AB12 -i- 0000 = AB12 = AB12 -i- FFFF. However, in a group fhere can be 
only one idenfify elemenf. If we have some elemenf 12AB, and assume fhe idenfify 




188 


The Internet Protocol (IP) 


element is 0000, then we need some inverse X' so that (12AB + X') = 0000, but we 
see that no such value of X' exists in V that satisfies fhe criferia. Therefore, we need 
fo exclude 0000 from considerafion as fhe idenfify elemenf in <V,+> by removing if 
from fhe sef V fo make fhis sfrucfure a frue group. For an infroducfion fo absfracf 
algebra, fhe reader may wish fo consulf a defailed fexf on fhe subjecf, such as fhe 
popular book by Pinfer [P90]. 

5.2.3 DS Field an6 ECN (Formerly Called the ToS Byte or IPv6 Traffic Class) 

The fhird and fourfh fields of fhe IPv4 header (second and fhird fields of fhe IPv6 
header) are fhe Differentiated Services (called DS Field) and ECN fields. Differenfi- 
afed Services (called DiffServ) is a framework and sef of sfandards aimed af sup- 
porfing differenfiafed classes of service (i.e., beyond jusf besf-efforf) on fhe Infernef 
[RFC2474][RFC2475][RFC3260]. IP dafagrams fhaf are marked in cerfain ways (by 
having some of fhese bifs sef according fo predefined pafferns) may be forwarded 
differenfly (e.g., wifh higher priorify) fhan ofher dafagrams. Doing so can lead 
fo increased or decreased queuing delay in fhe nefwork and ofher special effecfs 
(possibly wifh associafed special fees imposed by an ISP). A number is placed in 
fhe DS Field fermed fhe Differentiated Services Code Point (DSCP). A "code poinf" 
refers fo a parficular predefined arrangemenf of bifs wifh agreed-upon meaning. 
Typically, dafagrams have a DSCP assigned fo fhem when fhey are given fo fhe 
nefwork infrasfrucfure fhaf remains unmodified during delivery. Plowever, poli¬ 
cies (such as how many high-priorify packefs are allowed fo be senf in a period of 
fime) may cause a DSCP in a dafagram fo be changed during delivery. 

The pair of ECN bifs in fhe header is used for marking a dafagram wifh a 
congestion indicator when passing fhrough a roufer fhaf has a significanf amounf of 
infernally queued fraffic. Bofh bifs are sef by persisfenfly congesfed ECN-aware 
roufers when forwarding packefs. The use case envisioned for fhis funcfion is 
fhaf when a marked packef is received af fhe desfinafion, some protocol (such as 
TCP) will nofice fhaf fhe packef is marked and indicafe fhis facf back fo fhe sender, 
which would fhen slow down, fhereby easing congesfion before a roufer is forced 
fo drop fraffic because of overload. This mechanism is one of several aimed af 
avoiding or dealing wifh nefwork congesfion, which we explore in more defail in 
Chapfer 16. Alfhough fhe DS Field and ECN field are nof obviously closely relafed, 
fhe space for fhem was carved ouf of fhe previously defined IPv4 Type of Service 
and IPv6 Traffic Class fields. For fhis reason, fhey are often discussed fogefher, and 
fhe terms "ToS byfe" and "Traffic Class byfe" are sfill in widespread use. 

Alfhough fhe original uses for fhe ToS and Traffic Class bytes are nof widely 
supporfed, fhe sfrucfure of fhe DS Field has been arranged fo provide some back¬ 
ward compafibilify wifh fhem. To gef a clear undersfanding of how fhis has been 
accomplished, we firsf review fhe original sfrucfure of fhe Type of Service field 
[RFC0791] as shown in Figure 5-4. 
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Figure 5-4 The original IPv4 Type of Service and IPv6 Traffic Class field structures. The Precedence 
subfield was used to indicate which packets should receive higher priority (larger values 
mean higher priority). The D, T, and R subfields refer to delay, throughput, and reliabil¬ 
ity. A value of 1 in these fields corresponds to a desire for low delay, high throughput, 
and high reliability, respectively. 


The D, T, and R subfields are for indicafing fhaf fhe dafagram should receive 
good freafmenf wifh respecf fo delay, fhroughpuf, and reliabilify. A value of 1 indi¬ 
cafes beffer freafmenf (low delay, high fhroughpuf, high reliabilify, respecfively). 
The precedence values range from 000 (roufine) fo 111 (nefwork confrol) wifh 
increasing priorify (see Table 5-1). They are based on a call preempfion scheme 
called Multilevel Precedence and Preemption (MLPP) dafing back fo fhe U.S. Deparf- 
menf of Defense's AUTOVON felephone sysfem [A92], in which lower-precedence 
calls could be preempfed by higher-precedence calls. These ferms are sfill in use 
and are being incorporafed info VoIP sysfems. 


Table 5-1 The original IPv4 Type of Service and IPv6 Traffic Class precedence subfield values 


Value 

Precedence Name 


Routine 

001 

Priority 

010 

Immediate 

oil 

Flash 

100 

Flash Override 

101 

Critical 

110 

Internetwork Control 

111 

Network Control 


In defining fhe DS Field, fhe precedence values have been faken info accounf 
[RFC2474] so as fo provide a limifed form of backward compafibilify. Referring fo 
Figure 5-5, fhe 6-bif DS Field holds fhe DSCP, providing supporf for 64 disfincf 
code poinfs. The parficular value of fhe DSCP fells a roufer fhe forwarding freaf¬ 
menf or special handling fhe dafagram should receive. The various forwarding 
freafmenfs are expressed as per-hop behavior (PHB), so fhe DSCP value effecfively 
fells a roufer which PHB fo apply fo fhe dafagram. The defaulf value for fhe DSCP 
is generally 0, which corresponds fo roufine, besf-efforf Infernef fraffic. The 64 
possible DSCP values are broadly divided info a sef of pools for various uses, as 
given in [DSCPREG] and shown in Table 5-2. 
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Figure 5-5 The DS Field contains the DSCP in 6 bits (5 bits are currently standardized to indicate 
the forwarding treatment the datagram should receive when forwarded by a compliant 
router). The following 2 bits are used for ECN and may be turned on in the datagram 
when it passes through a persistently congested router. When such datagrams arrive 
at their destinations, the congestion indication is sent back to the source in a later data¬ 
gram to inform the source that its datagrams are passing through one or more congested 
routers. 


Table 5-2 The DSCP values are divided into three pools: standardized, experimental/local use 
(EXP/LU), and experimental/local use that is eventually intended for standardization (*). 


Pool 

Code Point Prefix 

Policy 

1 

xxxxxO 

Standards 

2 

xxxxll 

EXP/LU 

3 

xxxxOl 

EXP/LU(*) 


The arrangement provides for some experimentation and local use by 
researchers and operators. DSCPs ending in 0 are subject to standardized use, 
and those ending in 1 are for experimental/local use (EXP/LU). Those ending in 
01 are intended initially for experimentation or local use but with eventual intent 
toward standardization. 

Referring to Figure 5-5, the class portion of the DS Field contains the first 3 bits 
and is based on the earlier definition of the Precedence subfield of the Type of Service 
field. Generally, a router is to first segregate traffic into different classes. Traffic 
within a common class may have different drop probabilities, allowing the router 
to decide what traffic to drop first if it is forced to discard traffic. The 3-bit class 
selector provides for eight defined code points (called the class selector code points) 
that correspond to PHBs with a specified minimum set of features providing simi¬ 
lar functionality to the earlier IP precedence capability. These are called class selec¬ 
tor compliant PHBs. They are intended to support partial backward compatibility 
with the original definition given for the IP Precedence subfield given in [RFC0791]. 
Code points of the form xxxOOO always map to such PHBs, although other values 
may also map to the same PHBs. 

Table 5-3 indicates the class selector DSCP values with their corresponding 
terms for the IP Precedence field from [RFC0791]. The Assured Forwarding (AF) 
group provides forwarding of IP packets in a fixed number of independent AF 
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classes, effectively generalizing the precedence concept. Traffic from one class 
is forwarded separately from other classes. Within a traffic class, a datagram is 
assigned a drop precedence. Datagrams of higher drop precedence in a class are 
handled preferentially (i.e., are forwarded with higher priority) over those with 
lower drop precedence in the same class. Combining the traffic class and drop 
precedence, the name AFij corresponds to assured forwarding class i with drop 
precedence j. For example, a datagram marked with AF32 is in traffic class 3 with 
drop precedence 2. 


Table 5-3 The DS Field values are designed to be somewhat compatible with the IP Precedence 
subfield specified for the Type of Service and IPv6 Traffic Class field. AF and EF provide 
enhanced services beyond simple best-effort. 


Name 

Value 

Reference 

Description 

cso 

000000 

[RFC2474] 

Class selector (best-effort/routine) 

CSl 

001000 

[REC2474] 

Class selector (priority) 

CS2 

010000 

[RFC2474] 

Class selector (immediate) 

CS3 

011000 

[RFC2474] 

Class selector (flash) 

CS4 

100000 

[RFC2474] 

Class selector (flash override) 

CS5 

101000 

[RFC2474] 

Class selector (CRITIC/ECP) 

CS6 

110000 

[RFC2474] 

Class selector (internetwork control) 

CS7 

111000 

[RFC2474] 

Class selector (control) 

AFll 

001010 

[REC2597] 

Assured Porwarding (class l,dp 1) 

AF12 


[RFC2597] 

Assured Forwarding (1,2) 

AF13 


[RFC2597] 

Assured Forwarding (1,3) 

AF21 


[REC2597] 

Assured Forwarding (2,1) 

AF22 


[REC2597] 

Assured Forwarding (2,2) 

AF23 


[REC2597] 

Assured Forwarding (2,3) 

AF31 


[RFC2597] 

Assured Forwarding (3,1) 

AF32 


[RFC2597] 

Assured Forwarding (3,2) 

AF33 

eeeees^h 

[RFC2597] 

Assured Forwarding (3,3) 

AF41 


[RFC2597] 

Assured Forwarding (4,1) 

AF42 


[REC2597] 

Assured Forwarding (4,2) 

AF43 


[RFC2597] 

Assured Forwarding (4,3) 

EFPHB 

101110 

[RFC3246] 

Expedited Forwarding 

VOICE-ADMIT 

101100 

[RFC5865] 

Capacity-Admitted Traffic 


The Expedited Forwarding (EF) service provides the appearance of an uncon¬ 
gested network—that is, EF traffic should receive relatively low delay, jitter, and 
loss. Intuitively, this requires the rate of EF traffic going out of a router to be at 
least as large as the rate coming in. Consequently, EF traffic will only ever have to 
wait in a router queue behind other EF traffic. 
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Delivering differentiated services in the Internet has been an ongoing effort 
for over a decade. Although much of the standardization effort in terms of mecha¬ 
nisms took place in the late 1990s, only in the twenty-first century are some of its 
capabilities being realized and implemented. Some guidance on how to configure 
systems to take advantage of these capabilities is given in [RFC4594]. The com¬ 
plexity of differentiated services is due, in part, to the linkage between differenti¬ 
ated services and the presumed differentiated pricing structure and consequent 
issues of fairness that would go along with it. Such economic relationships can be 
complex and are outside the scope of the present discussion. For more information 
on this and related topics, please see [MB97] and [W03]. 

5.2.4 IP Options 

IP supports a number of options that may be selected on a per-datagram basis. 
Most of these options were introduced in [RFC0791] at the time IPv4 was being 
designed, when the Internet was considerably smaller and when threats from 
malicious users were less of a concern. As a consequence, many of the options are 
no longer practical or desirable because of the limited size of the IPv4 header or 
concerns regarding security. With IPv6, most of the options have been removed 
or altered and are not an integral part of the basic IPv6 header. Instead, they are 
placed after the IPv6 header in one or more extension headers. An IP router that 
receives a datagram containing options is usually supposed to perform special 
processing on the datagram. In some cases IPv6 routers process extension headers, 
but many headers are designed to be processed only by end hosts. In some routers, 
datagrams with options or extensions are not forwarded as fast as ordinary data¬ 
grams. We briefly discuss the IPv4 options as background and then look at how 
IPv6 implements extension headers and options. Table 5-4 shows most of the IPv4 
options that have been standardized over the years. 

Table 5-4 gives the reserved IPv4 options for which descriptive RFCs can be 
found. The complete list is periodically updated and is available online [IPPA- 
RAM]. The options area always ends on a 32-bit boundary. Pad bytes with a value 
of 0 are added if necessary. This ensures that the IPv4 header is always a multiple 
of 32 bits (as required by the IHL field). The "Number" column in Table 5-4 is the 
number of the option. The "Value" column indicates the number placed inside the 
option Type field to indicate the presence of the option. These values from the two 
columns are not necessarily the same because the Type field has additional struc¬ 
ture. In particular, the first (high-order) bit indicates whether the option should 
be copied into fragments if the associated datagram is fragmented. The next 2 bits 
indicate the option's class. Currently, all options in Table 5-4 use option class 0 
(control) except Timestamp and Traceroute, which are both class 2 (debugging and 
measurement). Classes 1 and 3 are reserved. 

Most of the standardized options are rarely or never used in the Internet today. 
Options such as Source and Record Route, for example, require IPv4 addresses to 
be placed inside the IPv4 header. Because there is only limited space in the header 
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Table 5-4 Options, if present, are carried in IPv4 packets immediately after the basic IPv4 header. Options 
are identified by an 8-bit option Type field. This field is subdivided into three subfields: Copy (1 bit). 
Class (2 bits), and Number (5 bits). Options 0 and 1 are a single byte long, and most others are variable 
in length. Variable options consist of 1 byte of type identifier, 1 byte of length, and the option itself. 


Name 

Number 

Value 

Length 

Description 

Reference 

Comments 

End of List 

0 

0 

1 

Indicates no more 
options. 

[RPC0791] 

If required 

No Op 

1 

1 

1 

Indicates no operation 
to perform (used for 
padding). 

[RPC0791] 

If required 

Source 

Routing 

3 

9 

131 

137 

Variable 

Sender lisfs router "way- 
points" for packet to tra¬ 
verse when forwarded. 
Loose means ofher 
routers can be included 
between waypoints 
(3,131). Strict means all 
waypoints have to be tra¬ 
versed exactly in order 
(9,137). 

[RFC0791] 

Rare, offen 
filtered 

Security and 

Handling 

Labels 

2 

5 

130 

133 

11 

Specifies how to include 
security labels and 
handling restrictions 
with IP datagrams in U.S. 
military environments. 

[REC1108] 

Historic 

Record 

Route 

7 

7 

Variable 

Records the route taken 
by a packet in its header. 

[REC0791] 

Rare 

Timestamp 

4 

68 

Variable 

Records the time of day 
at a packet's source and 
destination. 

[REC0791] 

Rare 

Stream ID 

8 

136 

4 

Carries the 16-bit 

SATNET stream 
identifier. 

[REC0791] 

Historic 

EIP 

17 

145 

Variable 

Extended Internet 
Protocol (an experiment 
in the early 1990s) 

[REC1385] 

Historic 

Traceroute 

18 

82 

Variable 

Adds a route-tracing 
option and ICMP 
message (an experiment 
in the early 1990s). 

[REC1393] 

Historic 

Router Alert 

20 

148 

4 

Indicates that a router 
needs to interpret the 
contents of fhe datagram. 

[REC2113] 

[RFC5350] 

Occasional 

Quick-Start 

25 

25 

8 

Indicates fast transport 
protocol start 
(experimental). 

[RFC4782] 

Rare 
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(60 bytes total, of which 20 are devoted to the basic IPv4 header), these options are 
not very useful in today's IPv4 Infernef where fhe number of router hops in an 
average Infernef pafh is abouf 15 [LFS07]. In addifion, fhe opfions are primarily 
for diagnosfic purposes and make fhe consfrucfion of firewalls more cumbersome 
and risky Thus, IPv4 opfions are fypically disallowed or sfripped af fhe perimeter 
of enferprise nefworks by firewalls (see Chapfer 7). 

Wifhin enferprise nefworks, where fhe average pafh lengfh is smaller and pro- 
fecfion from malicious users may be less of a concern, opfions can sfill be useful. 
In addifion, fhe Router Alerf opfion represenfs somewhaf of an excepfion to fhe 
problems wifh fhe ofher opfions for use on fhe Infernef. Because if is designed 
primarily as a performance opfimizafion and does nof change fundamenfal roufer 
behavior, if is permiffed more offen fhan fhe ofher opfions. As suggesfed previ¬ 
ously, some roufer implemenfafions have a highly opfimized infernal pafhway for 
forwarding IP fraffic confaining no opfions. The Roufer Alerf opfion informs rouf- 
ers fhaf a packef requires processing beyond fhe convenfional forwarding algo- 
rifhms. The experimenfal Quick-Sfarf opfion af fhe end of fhe fable is applicable fo 
bofh IPv4 and IPv6, and we describe if in fhe nexf secfion when discussing IPv6 
exfension headers and opfions. 


5.3 IPv6 Extension Headers 

In IPv6, special funcfions such as fhose provided by opfions in IPv4 can be enabled 
by adding exfension headers fhaf follow fhe IPv6 header. The roufing and fime- 
sfamp funcfions from IPv4 are supported fhis way, as well as some ofher funcfions 
such as fragmenfafion and exfra-large packefs fhaf were deemed fo be rarely used 
for mosf IPv6 fraffic (buf sfill desired) and fhereby did nof jusfify allocafing bifs 
in fhe IPv6 header fo supporf fhem. Wifh fhis arrangemenf, fhe IPv6 header is 
fixed af 40 byfes, and exfension headers are added only when needed. In choosing 
fhe IPv6 header fo be of a fixed size, and requiring fhaf exfension headers be pro¬ 
cessed only by end hosfs (wifh one excepfion), fhe designers of IPv6 have made fhe 
design and consfrucfion of high-performance roufers easier because fhe demands 
on packef processing af roufers can be simpler fhan wifh IPv4. In pracfice, packef- 
processing performance is governed by many factors, including fhe complexify 
of fhe protocol, fhe capabilifies of fhe hardware and soffware in fhe roufer, and 
fraffic load. 

Exfension headers, along wifh headers of higher-layer protocols such as TCP 
or UDP, are chained fogefher wifh fhe IPv6 header fo form a cascade of headers 
(see Figure 5-6). The Next Header field in each header indicates fhe type of fhe 
subsequenf header, which could be an IPv6 exfension header or some ofher fype. 
The value of 59 indicates fhe end of fhe header chain. The possible values for fhe 
Next Header field are available af [IP6PARAM], and mosf are provided in Table 5-5. 

As we can see from Table 5-5, fhe IPv6 exfension header mechanism disfin- 
guishes some funcfions (e.g., roufing and fragmenfafion) from opfions. The order 
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IPv6 Datagram - 
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Next Header = TCP 

TCP 

Header 

TCP Data 

40 bytes 

20 bytes 
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IPv6 Header 

Next Header = Routing 
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Header 
Next Header 
= TCP 

TCP 
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IPv6 Header 
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Header 
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Header 

TCP 

TCP Data 

Next Header = Routing 

Next Header 
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Header 

= Fragment 

= TCP 




40 bytes 20 bytes 

Figure 5-6 IPv6 headers form a chain using the Next Header field. Headers in the chain 
may be IPv6 extension headers or transport headers. The IPv6 header appears 
at the beginning of the datagram and is always 40 bytes long. 


Table 5-5 The values for the IPv6 Next Header field may indicate extensions or headers for other protocols. The 
same values are used with the IPv4 Protocol field, where appropriate. 


Header Type 

Order 

Value 

References 

IPv6 header 

1 

41 

[RFC2460][RFC2473] 

Hop-by-Hop 

Options (HOPOPT) 

2 

0 

[RFC2460]; must immediately follow 
IPv6 header 

Destination Options 

3,8 

60 

[RFC2460] 

Routing 

4 

43 

[RFC2460][RFC5095] 

Fragment 

5 

44 

[RFC2460] 

Encapsulating Security Payload (ESP) 

7 

50 

(See Chapter 18) 

Authentication (AH) 

6 

51 

(See Chapter 18) 

Mobility (MIPv6) 

9 

135 

[RFC6275] 

(None—no next header) 

Last 

59 

[RFC2460] 

ICMPv6 

Last 

58 

(See Chapter 8) 

UDP 

Last 

17 

(See Chapter 10) 

TCP 

Last 

6 

(See Chapters 13-17) 

Various other upper-layer protocols 

Last 

— 

See [AN] for complete list 
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of the extension headers is given as a recommendation, except for the location of 
the Hop-by-Hop Options, which is mandatory, so an IPv6 implementation must 
be prepared to process extension headers in the order in which they are received. 
Only the Destination Options header can be used twice—the first time for options 
pertaining to the destination IPv6 address contained in the IPv6 header and the 
second time (position 8) for options pertaining to the final destination of the data¬ 
gram. In some cases (e.g., when the Routing header is used), the Destination IP 
Address field in the IPv6 header changes as the datagram is forwarded to its ulti¬ 
mate destination. 

5.3.1 IPv6 Options 

As we have seen, IPv6 provides a more flexible and extensible way of incorporat¬ 
ing extensions and options as compared to IPv4. Those options from IPv4 that 
ceased to be useful because of space limitations in the IPv4 header appear in IPv6 
as variable-length extension headers or options encoded in special extension 
headers that can accommodate today's much larger Internet. Options, if present, 
are grouped into either Hop-by-Hop Options (those relevant to every router along a 
datagram's path) or Destination Options (those relevant only to the recipient). Hop- 
by-Hop Options (called HOPOPTs) are the only ones that need to be processed 
by every router a packet encounters. The format for encoding options within the 
Hop-by-Hop and Destination Options extension headers is common. 

The Hop-by-Hop and Destination Options headers are capable of holding 
more than one option. Each of these options is encoded as type-length-value (TLV) 
sets, according to the format shown in Figure 5-7. 


0 


15 16 


Action 
{2 bits) 


Type 
Subfield 
(5 bits) 


Opt Data Len 
(8 bits) 


v_ 


w 




Option Type 


Option Data 


Figure 5-7 Hop-by-hop and Destination Options are encoded as TLV sets. The first byte gives 
the option type, including subfields indicating how an IPv6 node should behave if the 
option is not recognized, and whether the option data might change as the datagram is 
forwarded. The Opt Data Len field gives the size of the option data in bytes. 


The TLV structure shown in Figure 5-7 includes 2 bytes followed by a variable- 
length number of data bytes. The first byte indicates the type of the option and 
includes three subfields. The first subfield gives the action to be taken by an IPv6 
node attempting to process the option that does not recognize the 5-bit option Type 
subfield. Its possible values are presented in Table 5-6. 
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Table 5-6 The 2 high-order bits in an IPv6 TLV option type indicate whether an IPv6 node should 
forward or drop the datagram if the option is not recognized, and whether a message 
indicating the datagram's fate should be sent back to the sender. 


Value 

Action 

00 

Skip option, continue processing 

01 

Discard the datagram (silently) 

10 

Discard the datagram and send an ICMPv6 Parameter Problem message to 
the source address 

11 

Same as 10, but send the ICMPv6 message only if the offending packet's 
destination was not multicast 


If an unknown option were included in a datagram destined for a multicast 
destination, a large number of nodes could conceivably generate traffic back to the 
source. This can be avoided by use of the 11-bit pattern for the Action subfield. The 
flexibility of the Action subfield is useful in the development of new options. A 
newly specified option can be carried in datagrams and simply ignored by those 
routers that do not understand it, helping to promote incremental deployment of 
new options. The Change bit field {Chg in Figure 5-7) is set to 1 when the option data 
may be modified as the datagram is forwarded. The options shown in Table 5-7 
have been defined for IPv6. 


Table 5-7 Options in IPv6 are carried in either Hop-by-Hop (H) or Destination (D) Options exten¬ 
sion headers. The option Type field contains the value from the "Type" column with the 
Action and Change subfields denoted in binary. The "Length" column contains the value 
of the Opt Data Len byte from Figure 5-7. The Padl option is the only one lacking this byte. 


Option Name 

Header 

Action 

Change 

Type 

Length 

References 

Padl 

HD 

00 

0 

0 

N/A 

[REC2460] 

PadN 

HD 

00 

0 

1 

var 

[REC2460] 

Jumbo Payload 

H 

11 

0 

194 

4 

[REC2675] 

Tunnel Encapsulation 
Limit 

D 

00 

0 

4 

4 

[REC2473] 

Router Alert 

H 

00 

0 

5 

4 

[REC2711] 

Quick-Start 

H 

00 

1 

6 

8 

[REC4782] 

CALIPSO 

H 

00 

0 

7 

8+ 

[REC5570] 

Home Address 

D 

11 

0 

201 

16 

[REC6275] 


5.3.1.1 Padl and PadN 

IPv6 options are aligned to 8-byte offsets, so options that are naturally smaller are 
padded with 0 bytes to round out their lengths to the nearest 8 bytes. Two padding 
options are available to support this, called Padl and PadN. The Padl option (type 0) 
is the only option that lacks Length and Value fields. It is simply 1 byte long and 
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contains the value 0. The PadN option (type 1) inserts 2 or more bytes of padding 
into the options area of fhe header using fhe formaf of Figure 5-7. For n byfes of 
padding, fhe Opt Data Len field confains fhe value (n-2). 

5.3.1.2 IPv6 Jumbo Payload 

In some TCP/IP nefworks, such as fhose used fo inferconnecf supercompufers, 
fhe normal 64KB limif on fhe IP dafagram size can lead fo unwanfed overhead 
when moving large amounfs of dafa. The IPv6 Jumbo Payload opfion specifies an 
IPv6 dafagram wifh payload size larger fhan 65,535 byfes, called ajumbogram. This 
opfion need nof be implemenfed by nodes affached fo links wifh MTU sizes below 
64KB. The Jumbo Payload opfion provides a 32-bif field for holding fhe payload 
size for dafagrams wifh payloads of sizes befween 65,535 and 4,294,967,295 byfes. 

When a jumbogram is formed for fransmission, ifs normal Payload Length field 
is sef fo 0. As we shall see lafer, fhe TCP protocol makes use of fhe Payload Length 
field in order fo compufe ifs checksum using fhe Infernef checksum algorifhm 
described previously. When fhe Jumbo Payload opfion is used, TCP musf be care¬ 
ful fo use fhe lengfh value from fhe opfion insfead of fhe regular Length field in 
fhe base header. Alfhough fhis procedure is nof difficulf, larger payloads can lead 
fo an increased chance of undefecfed error [RFC2675]. 

5.3.1.3 Tunnel Encapsulation Limit 

Tunneling refers fo fhe encapsulafion of one protocol in anofher fhaf does nof con¬ 
form fo fradifional layering (see Chapters 1 and 3). For example, IP dafagrams may 
be encapsulated inside fhe payload porfion of anofher IP dafagram. Tunneling can 
be used fo form virfual overlay nefworks, in which one nefwork (e.g., fhe Infernef) 
acfs as a well-connecfed link layer for anofher layer of IP [TWEF03]. Tunnels can 
be nesfed in fhe sense fhaf dafagrams fhaf are in a funnel may fhemselves be 
placed in a funnel, in a recursive fashion. 

When sending an IP dafagram, a sender does nof ordinarily have much con- 
frol over how many funnel levels are ulfimafely used for encapsulafion. Using fhis 
opfion, however, a sender can specify fhis limif. A router intending fo encapsulafe 
an IPv6 dafagram info a funnel firsf checks for fhe presence and value of fhe Tun¬ 
nel Encapsulafion Limif opfion. If fhe limif value is 0, fhe dafagram is discarded 
and an ICMPv6 Paramefer Problem message (see Chapfer 8) is senf fo fhe source 
of fhe dafagram (i.e., fhe previous funnel enfry poinf). If fhe limif is nonzero, fhe 
funnel encapsulafion is permiffed, buf fhe newly formed (encapsulafing) IPv6 
dafagram musf include a Tunnel Encapsulafion Limif opfion whose value is 1 less 
fhan fhe opfion value in fhe arriving dafagram. In effecf, fhe encapsulafion limif 
acfs like fhe IPv4 TTL or IPv6 Hop Limit field, buf for levels of funnel encapsulafion 
insfead of forwarding hops. 

5.3.1.4 RouterAiert 

The Roufer Alerf opfion indicafes fhaf fhe dafagram confains informafion fhaf 
needs fo be processed by a roufer. If is used for fhe same purpose as fhe IPv4 
Roufer Alerf opfion. [RTAOPTS] gives fhe currenf sef of values for fhe opfion. 
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5.3.1.5 Quick-Start 

The Quick-Start (QS) option is used in conjunction with the experimental Quick- 
Start procedure for TCP/IP specified in [RFC4782]. If is applicable fo bofh IPv4 and 
IPv6 buf af presenf is suggesfed only for privafe nefworks and nof fhe global Infer- 
nef. The opfion includes a value encoding fhe sender's desired fransmission rafe in 
bifs per second, a QS TTL value, and some addifional informafion. Roufers along 
fhe pafh may agree fhaf supporfing fhe desired rafe is accepfable, in which case 
fhey decremenf fhe QS TTL and leave fhe rafe requesf unchanged when forward¬ 
ing fhe confaining dafagram. When fhey disagree (i.e., wish fo supporf a lower 
rafe), fhey can reduce fhe number fo an accepfable rafe. Roufers fhaf do nof recog¬ 
nize fhe QS opfion do nof decremenf fhe QS TTL. A receiver provides feedback fo 
fhe sender, including fhe difference befween fhe received dafagram's IPv4 TTL or 
IPv6 Hop Limit field and ifs QS TTL, along wifh fhe resulfing rafe fhaf may have 
been adjusfed by fhe roufers along fhe forward pafh. This informafion is used by 
fhe sender fo defermine ifs sending rafe (which, for example, may exceed fhe rafe 
TCP if would ofherwise use). Comparison of fhe TTL values is used fo ensure fhaf 
every roufer along fhe pafh parficipafes in fhe QS negofiafion; if any roufers are 
found fo be decremenfing fhe IPv4 TTL (or IPv6 Hop Limit) field and nof modify¬ 
ing fhe QS TTL value, QS is nof enabled. 

5.3.1.6 CALIPSO 

This opfion is used for supporfing fhe Common Architecture Label IPv6 Security 
Option (CALIPSQ) [RFC5570] in cerfain privafe nefworks. If provides a mefhod fo 
label dafagrams wifh a securify-level indicafor, along wifh some addifional infor¬ 
mafion. In parficular, if is infended for use in mulfilevel secure nefworking envi- 
ronmenfs (e.g., governmenf, milifary, and banking) where fhe securify level of all 
dafa musf be indicafed by some form of label. 

5.3.1.7 Home Address 

This opfion holds fhe "home" address of fhe IPv6 node sending fhe dafagram 
when IPv6 mobilify opfions are in use. Mobile IP (see Secfion 5.5) specifies a sef of 
procedures for handling IP nodes fhaf may change fheir poinf of nefwork affach- 
menf wifhouf losing fheir higher-layer nefwork connecfions. If has a concepf of 
a node's "home," which is derived from fhe address prefix of ifs fypical locafion. 
When roaming away from home, fhe node is generally assigned a differenf IP 
address. This opfion allows fhe node fo provide ifs normal home address in addi- 
fion fo ifs (presumably femporarily assigned) new address while fraveling. The 
home address can be used by ofher IPv6 nodes when communicafing wifh fhe 
mobile node. If fhe Home Address opfion is presenf, fhe Desfinafion Qpfions 
header confaining if musf appear affer a Roufing header and before fhe Fragmenf, 
Aufhenficafion, and ESP headers (see Chapfer 18), if any of fhem is also presenf. 
We discuss fhis opfion in more defail in fhe confexf of Mobile IP. 
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5.3.2 Routing Header 

The IPv6 Routing header provides a mechanism for the sender of an IPv6 data¬ 
gram to control, at least in part, the path the datagram takes through the network. 
At present, two different versions of the routing extension header have been speci¬ 
fied, called type 0 (RHO) and type 2 (RH2), respectively RHO has been deprecated 
because of security concerns [RFC5095], and RH2 is defined in conjunction with 
Mobile IP To best understand the Routing header, we begin by discussing RHO 
and then investigate why it has been deprecated and how it differs from RH2. RHO 
specifies one or more IPv6 nodes to be "visited" as the datagram is forwarded. The 
header is shown in Figure 5-8. 


0 


15 16 


31 


Next Header 

Header Extension 

Routing Type (0) 

Segments Left 

(8 bits) 

Length (8 bits) 

(8 bits) 

(8 bits) 


Reserved (0) 



(32 bits) 



IP Address [1] 
(128 bits) 


Routing 

Extension 

Header 

(variable) 


IP Address [n] 
(128 bits) 


Figure 5-8 The now-deprecated Routing header type 0 (RHO) generalizes the IPv4 loose and strict 
Source Route and Record Route options. It is constructed by the sender to include IPv6 
node addresses that act as waypoints when the datagram is forwarded. Each address can 
be specified as a loose or strict address. A strict address must be reached by a single IPv6 
hop, whereas a loose address may contain one or more other hops in between. The IPv6 
Destination IP Address field in the base header is modified to contain the next waypoint 
address as the datagram is forwarded. 


The IPv6 Routing header shown in Figure 5-8 generalizes the loose Source 
and Record Route options from IPv4. It also supports the possibility of routing on 
identifiers other than IPv6 addresses, although this feature is not standardized 
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and is not discussed further here. For standardized routing on IPv6 addresses, 
RHO allows the sender to specify a vecfor of IPv6 addresses for nodes fo be visifed. 

The header confains an 8-bif Routing Type idenfifier and an 8-bif Segments 
Left field. The type identifier for IPv6 addresses is 0 for RHO and 2 for RH2. The 
Segments Left field indicafes how many roufe segmenfs remain fo be processed— 
fhaf is, fhe number of explicifly lisfed infermediafe nodes sfill fo be visifed before 
reaching fhe final desfinafion. The block of addresses sfarfs wifh a 32-bif reserved 
field sef by fhe sender fo 0 and ignored by receivers. The addresses are nonmulfi- 
casf IPv6 addresses fo be visifed as fhe dafagram is forwarded. 

A Roufing header is nof processed unfil if reaches fhe node whose address is 
confained in fhe Destination IP Address field of fhe IPv6 header. Af fhis fime, fhe 
Segments Left field is used fo defermine fhe nexf hop address from fhe address vec¬ 
for, and fhis address is swapped wifh fhe Destination IP Address field in fhe IPv6 
header. Thus, as fhe dafagram is forwarded, fhe Segments Left field grows smaller, 
and fhe lisf of addresses in fhe header reflecfs fhe node addresses fhaf forwarded 
fhe dafagram. The forwarding procedure is beffer understood wifh an example 
(see Figure 5-9). 



Figure 5-9 Using an IPv6 Routing header (RHO), the sender (S) is able to direct the datagram 
through the intermediate nodes R^ and R^. The other nodes traversed are determined by 
the normal IPv6 routing. Note that the destination address in the IPv6 header is updated 
at each hop specified in the Routing header. 


In Figure 5-9 we can see how the Routing header is processed by intermedi¬ 
ate nodes. The sender (S) constructs the datagram with destination address R^ 
and a Routing header (type 0) containing the addresses R^, Rj, and D. The final 
destination of the datagram is the last address in the list (D). The Segments Left 
field (labeled "Left" in Figure 5-9) starts at 3. The datagram is forwarded toward 
Rj automatically by S and R^,. Because R^'s address is not present in the datagram. 
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no modifications of the Routing header or addresses are performed by R^. Upon 
reaching Rj, the destination address from the base header is swapped with the first 
address listed in the Routing header and the Segments Left field is decremented. 

As the datagram is forwarded, the process of swapping the destination 
address with the next address from the address list in the Routing header repeats 
until the last destination listed in the Routing header is reached. 

We can arrange to include a Routing header with a simple command-line 
option to the ping6 command in Windows XP (Windows Vista and later include 
only the ping command, which incorporates IPv6 support): 


C:\> pingS -r -s 2001:db8::100 2001:db8::l 

This command arranges to use the source address 2001 :db8:: 100 when sending a 
ping request to 2001:db8::l. The -r option arranges for a Routing header (RHO) 
to be included. We can see the outgoing request using Wireshark (see Figure 5-10). 
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No. Time Protocol Source Destination 

Info 

A 




' 

a Frame 1: 118 bytes on wire (944 bits), 118 bytes captured (944 

bits) 


a Ethernet ii, Src: 00:05:4e:4a:24:bb (00:05:4e:4a:24:bb), Dst: 

00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80) 


a Internet Protocol version 6, Src: 2001:db8::100 (2001:db8::100), Ost: 2001:db8::l (2001:db8::1) 


B 0110 .... = version: 6 


a .... 0000 0000 . = Traffic class: 0x00000000 

. 0000 0000 0000 0000 0000 = Flowlabel: 0x00000000 

Payload length: 64 

Next header: IPv6 routing (0x2b) 

Hop limit: 128 

source: 2001:db8::100 (2001:db8::100) 

Destination: 2001:db8::l C2001:db8::1) 

□ Routing Header, Type : IPv6 source Routing (0) 

Next header: icmpv 6 (0x3a) 

Length: 2 (24 bytes) 

Type: IPv6 source Routing (0) 

Left segments: 1 

Address: 2001:db8::100 (2001:db8::100) 
a Internet Control Message Protocol v6 


0000 00 04 5a 9f 9e 80 00 05 4e 4a 24 bb 86 dd 60 00 ..Z. NJ$...'. 

0010 00 00 00 40 2b 80 20 01 Od bS 00 00 00 00 00 00 ...©+. 

0020 00 00 00 00 01 00 20 01 Qd b8 00 00 00 00 00 00 

0050 

0060 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 61 62 63 klmnopqr stuvwabc 

0070 64 65 66 67 68 69 defghi 



Figure 5-10 The ping request appears as an ICMPv6 Echo Request in Wireshark. The IPv6 header 
includes a Next Header field indicating that the packet contains a type 0 Routing header, 
followed by an ICMPv6 header. The number of segments in the RHO left to be processed 
is one (2001:db8::100). 


The ping message appears as an ICMPv6 Echo Request packet (see Chapter 
8). By following the Next Header field values, we can see that the base header is 
followed by a Routing header. In the Routing header, we can see that the type is 
0 (indicating an RHO), and there is one segment (hop) left to process. The hop is 
specified by the first slot in the address list (number 0): 2001:db8::100. 
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As mentioned previously, RHO has been deprecated by [RFC5095] because of 
a security concern that allows RHO to be used to increase the effectiveness of DoS 
affacks. The problem is fhaf RHO allows fhe same address fo be specified in mul- 
fiple locafions wifhin fhe Roufing header. This can lead fo fraffic being forwarded 
many fimes befween fwo or more hosfs or roufers along a parficular pafh. The 
pofenfially high fraffic loads fhaf can be creafed along parficular pafhs in fhe nef- 
work can cause disrupfion fo ofher fraffic flows compefing for bandwidfh across 
fhe same pafh. Consequenfly, RHO has been deprecafed and only RH2 remains as 
fhe sole Roufing header supporfed by IPv6. RH2 is equivalenf fo RHO excepf if has 
room for only a single address and uses a differenf value in fhe Routing Type field. 

5.3.3 Fragment Header 

The Fragmenf header is used by an IPv6 source when sending a dafagram larger 
fhan fhe pafh MTU of fhe dafagram's infended desfinafion. Pafh MTU and how 
if is defermined are discussed in more defail in Chapfer 13, buf 1280 byfes is a 
nefwork-wide link-layer minimum MTU for IPv6 (see secfion 5 of [RFC2460]). In 
IPv4, any hosf or roufer can fragmenf a dafagram if if is too large for fhe MTU on 
fhe nexf hop, and fields wifhin fhe second 32-bif word of fhe IPv4 header indicafe 
fhe fragmenfafion informafion. In IPv6, only fhe sender of fhe dafagram is permif- 
fed fo perform fragmenfafion, and in such cases a Fragmenf header is added. 

The Fragmenf header includes fhe same informafion as is found in fhe IPv4 
header, buf fhe Identification field is 32 bifs insfead of fhe 16 fhaf are used for IPv4. 
The larger field provides fhe abilify for more fragmenfed packefs fo be oufsfand- 
ing in fhe nefwork simulfaneously. The Fragmenf header uses fhe formaf shown 
in Figure 5-11. 

0 15 16 31 

Fragment 
Extension 
Header 
(8 bytes) 


Figure 5-11 The IPv6 Fragment header contains a 32-bit Identification field (twice as large as the Iden¬ 
tification field in IPv4). The M bit field indicates whether the fragment is the last of an 
original datagram. As with IPv4, the Fragment Offset field gives the offset of the payload 
into the original datagram in 8-byte units. 


Next Header 

Reserved (0) 

Fragment Offset 

Res 

(8 bits) 

(8 bits) 

(13 bits) 

(2 bits) 


Identification 
(32 bits) 


Referring fo Figure 5-11, fhe Reserved field and 2-bif Res field are bofh zero 
and ignored by receivers. The Fragment Offset field indicafes where fhe dafa fhaf 
follows fhe Fragmenf header is locafed, as a posifive offsef in 8-byfe unifs, relafive 
fo fhe "fragmenfable parf" (see fhe nexf paragraph) of fhe original IPv6 dafagram. 
The M bif field, if sef fo 1, indicafes fhaf more fragmenfs are confained in fhe 
dafagram. A value of 0 indicafes fhaf fhe fragmenf confains fhe lasf byfes of fhe 
original dafagram. 
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The datagram serving as input to the fragmentation process is called the 
"original packet" and consists of fwo parfs: fhe "unfragmenfable parf" and fhe 
"fragmenfable parf." The unfragmenfable parf includes fhe IPv6 header and any 
included exfension headers required fo be processed by infermediafe nodes fo fhe 
desfinafion (i.e., all headers up fo and including fhe Roufing header, ofherwise 
fhe Hop-by-Hop Opfions exfension header if only if is presenf). The fragmenfable 
parf consfifufes fhe remainder of fhe dafagram (i.e., Desfinafion Opfions header, 
upper-layer headers, and payload dafa). 

When fhe original packef is fragmenfed, mulfiple fragmenf packefs are pro¬ 
duced, each of which confains a copy of fhe unfragmenfable parf of fhe origi¬ 
nal packef, buf for which each IPv6 header has fhe Payload Length field alfered fo 
reflecf fhe size of fhe fragmenf packef if describes. Following fhe unfragmenfable 
parf, each new fragmenf packef confains a Fragmenf header wifh an appropriafely 
assigned Fragment Offset field (e.g., fhe firsf fragmenf confains offsef 0) and a copy 
of fhe original packef's Identification field. The lasf fragmenf has ifs M {More Frag¬ 
ments) bif field sef fo 0. 

The following example illusfrafes fhe way an IPv6 source mighf fragmenf a 
dafagram. In fhe example shown in Figure 5-12, a payload of 3960 byfes is frag¬ 
menfed such fhaf no fragmenf's fofal packef size exceeds 1500 byfes (a fypical MTU 
for Efhernef), yef fhe fragmenf dafa sizes sfill are arranged fo be mulfiples of 8 byfes. 


IPv6 Header 

IP Data 

(40 bytes) 

(3960 bytes) 


Original Packet 
Length = 4000 
Payload Length = 3960 


1496 bytes 


- 1448 bytes -► 


IPv6 Header 
(40 bytes) 


Frag 
Header 
(8 bytes) 


IP Data 


Offset = 0 


MF = 1 

Identification = 2439898 


First Fragment 
Length = 1496 
Payload Length = 1456 
Data Length = 1448 


IPv6 Header 
(40 bytes) 


Frag 
Header 
(8 bytes) 


IP Data 


Offsets 181 


MF = 1 

Identification = 2439898 


Second Fragment 
Length = 1496 
Payload Length = 1456 
Data Length = 1448 


IPv6 Header 
(40 bytes) 


Frag 
Header 
(8 bytes) 


IP Data 
(1064 bytes) 


Offset = 362 


MF = 0 


Identification = 2439898 


Third (finah Fragment 
Length = 1112 
Payload Length = 1072 
Data Length = 1064 


Figure 5-12 An example of IPv6 fragmentation where a 3960-byte payload is split into three frag¬ 
ment packets of size 1448 bytes or less. Each fragment contains a Fragment header with 
the identical Identification field. All but the last fragment have the More Fragments field 
(M) set to 1. The offset is given in 8-byte units—the last fragment, for example, con¬ 
tains data beginning at offset (362 * 8) = 2896 bytes from the beginning of the original 
packet's data. The scheme is similar to fragmentation in IPv4. 
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In Figure 5-12 we see how the larger original packet has been fragmented 
into three smaller packets, each containing a Fragment header. The IPv6 header's 
Payload Length field is modified fo reflecf fhe size of fhe dafa and newly formed 
Fragmenf header. The Fragmenf header in each fragmenf confains a common Iden¬ 
tification field, and fhe sender ensures fhaf no disfincf original packefs are assigned 
fhe same field value wifhin fhe expecfed lifefime of a dafagram on fhe nefwork. 

The Offset field in fhe Fragmenf header is given in 8-byfe unifs, so fragmenfa- 
fion is performed af 8-byfe boundaries, which is why fhe firsf and second fragmenfs 
confain 1448 dafa byfes insfead of 1452. Thus, all buf fhe lasf fragmenf (possibly) is a 
mulfiple of 8 byfes. The receiver musf ensure fhaf all fragmenfs of an original dafa¬ 
gram have been received before performing reassembly. The reassembly procedure 
aggregafes fhe fragmenfs, forming fhe original dafagram. As wifh fragmenfafion in 
IPv4 (see Chapfer 10), fragmenfs may arrive ouf of order af fhe receiver buf are reas¬ 
sembled in order fo form a dafagram fhaf is given fo ofher protocols for processing. 

We can see fhe consfrucfion of an IPv6 fragmenf using fhis command on Win¬ 
dows 7: 


C:\> ping -1 3952 ff01::2 

Figure 5-13 shows fhe Wireshark oufpuf of fhe acfivify on fhe nefwork as if runs. 


® ip6frag-win7.t(i [^[6][5^ 


Fite Edit View Go Capture An^ze Statistics Telephony Tools Internals Help 

siwiatoiii |ii|a: q.q,gie!i a 

No. Time Source Destination Protocol Info 


lO.QQQOQQ feSQ; ;8db7;c661:2f58:e6f7 ff01::2 IPv6 IPv6 fragment (nxt=ICMPv6 (Qx3a) off=Q id=QxlO) 


2 0. 001225 fe80: :tfLlb7:c661: Jf58:e6f7 ff01::2 IPv6 IPv6 fragment (hxt=ICMPv6 (0x3aj off=1448 icl=0xl0j 

»3IQlQ02!ir4'5Mfie8QM8gb'7gg66!»2t58116'fi'MfifiQlL'M2M ICMPv6 Echo^CpTng) request Td=0x0001,_seqi4.3___ 

44.838179 fe80::8db7:c661:2f58:e6f7 ff01::2 IPv6 IPv6 fragment Cnxt=ICMPv6 C0x3a) off=0 id=0xl2) 

54.839406 fe80::8db7:c661:2f58:e6f7 ff01::2 IPv6 IPv6 fragment (nxt-ICMPv6 C0x3a) off=1448 id-0xl2) 

6 4.'84'Q3-26^Kfie8QM8Bb'7Jg66W2'658;e6'&7r;ffQl'-t :2 ICMPv6 Echo (ping) request id=0x0001.4_seq=44 __ 

7 9. 845752 fe80: :8db7:c661: 2f 58: e6f 7 ff01::2 IPv6 IPv6 fragment Cnxt=ICMPv6 C0x3a) off=0 id=0xl4) 

89.846973 fe80::8db7:c661:2f58:e6f7 ff01::2 IPv6 IPv6 fragment (nxt=ICMPv6 C0x3a) off=1448 id=0xl4) 

_ 9.9.847880_ fe80: :_8db7:'5.66152f 58: e6^;7J:tf 01: : 2 ICMPv6 Echo (ping) request id=0x0001, seq=4,5_,_“ 

10 14.837626 fe80: :8db7:c661: 2f 58: e6f 7 ff01::2 IPv6 IPv6 fragment (nxt=ICMPv6 C0x3a) off=0 ■id=0xl6) 

11 14.838850 fe80: :8db7:c661:2f58:e6f7 ff01::2 IPv6 IPv6 fragment Cnxt=ICMPv6 (0x3a) off=1448 ■id=0xl6) 

12 14.841029 fe80: :8db7:c661:2f58:e6f7 ff01::2 ICMPv6 Echo (ping) request id=0x0001, seq=46 


±-Ethernet II, Src: 20:6a:8a:31:7c:74 C20:6a:8a:31:7c: 74 j(, Ost: 33:33:00:00:00:02 C33 :33 :00:00:00:02 J 
H Internet Protocol version 6, Src: fe80::8db7:c661:2f58:e6f7 (feSO::8db7:c661:2f58:e6f7), Dst: ff01::2 (ff 
a 0110 _ = version: 6 

a .... 0000 0000 . = Traffic class: 0x00000000 

. 0000 0000 0000 0000 0000 = Flowlabel: 0x00000000 

Payload length: 1456 

Next header: IPv6 fragment (0x2c) 

Hop limit: 128 

source: feSO::8db7:c661:2f58:e6f7 (feSO::8db7:c661:2f58:e6f7) 

Destination: ff01::2 (ff01::2^ 
a Fragmentation Header 

Next header: icmpv 6 (OxSa) 

0000 0000 0000 0... = Offset: 0 COxOOOO) 

.1 = More Fragment: Yes 

Identif1cation: 0x00000010 
Reassembled IPv6 in frame: 3 
a Data (1448 bytes) 

Data: 800035620001002b6162636465666768696a6b6c6d6e6f70... 

[Length: 1448] 

I > 


Figure 5-13 The ping program generates ICMPv6 packets (see Chapter 8) containing 3960 IPv6 
payload bytes in this example. These packets are fragmented to produce three packet 
fragments, each of which is small enough to fit in the Ethernet MTU size of 1500 bytes. 
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In Figure 5-13 we see the fragments constituting four ICMPv6 Echo Requesf 
messages senf fo fhe IPv6 mulficasf address ffOl: :2. Each requesf requires frag- 
menfafion because fhe -1 3952 opfion indicafes fhaf 3952 dafa byfes are fo be car¬ 
ried in fhe dafa area of each ICMPv6 message (leading fo an IPv6 payload lengfh 
of 3960 byfes due fo fhe 8-byfe ICMPv6 header). The IPv6 source address is link- 
local. To defermine fhe fargef's link-layer mulficasf address, a mapping procedure 
specific fo IPv6 is performed, described in Chapfer 9. The ICMPv6 Echo Requesf 
(generafed by fhe ping program) spans several fragmenfs, which Wireshark reas¬ 
sembles fo display once if has processed all fhe consfifuenf fragmenfs. Eigure 5-14 
shows fhe second fragmenf in more defail. 


Q ip6frag-win7.td 


Fte Edit View Go Capture Analyze Statistics Telephony Tools Internals Help 

siiwaiatit sexsa ^'Isla 


BOB 



W ^ % 


3 0.002145 

4 4.838179 

5 4.839406 
, 6 4.840326 

7 9. 845752 



ICMPV6 Ec_ho 
IPv6 IPv6 
IPv6 IPv6 
ICMPV6 Ec_ho 
IPv6 IPv6 
IPv6 IPv6 
ICMPV6 Echo 
IPv6 IPv6 
IPv6 IPv6 
ICMPV6 Echo 


(ping, 

fragment 

fragment 

.(ping) ri 

fragment 

fragment 

.(ping) ri 

fragment 

fragment 

(ping) ri 


(nxt=ICMPv6 (0x3a) off=0 ■id=0xl2) 
(nxt=ICMPv6 (0x3a) off=1448 Td=0xl2) 
equest id=QxQQQl. seq5.44 
(nxt=lCMPv6 (0x3a) off=0 1d=0xl4) 
Cnxt=lCMPv6 (0x3a) off=1448 1d=0xl4) 
equest id= QxQQQl. seQ =45 
(nxt=lCMPv6 (0x3a) off=0 1d=0xl6) 
(nxt-lCMPv6 (0x3a) off=1448 1d-0xl6) 
equest 1d=0x000l, seq=46 


IS Frame 2: 1510 bytes on wire (12080 bits), 1510 bytes captured (12080 bits) 

IS Ethernet ii, src: 20:6a:8a:31:7c:74 (20:6a:8a:31:7c:74), ost: 33:33:00:00:00:02 (33:33:00:00:00:02) 
a internet Protocol version 6, src: feSO::8db7:c661:2f58:e6f7 (feSO::8db7:c661:2f58:e6f7), ost: ff01::2 (ffOl 
a 0110 .... = Version: 6 

a .... 0000 0000 . = Traffic class: 0x00000000 

. 0000 0000 0000 0000 0000 = Flowlabel: 0x00000000 

Payload length: 1456 

Next header: IPv6 fragment (0x2c) 

Hop limit: 128 

Source: fe80::8db7:c661:2f58:e6f7 (fe80::8db7:c661:2f58:e6f7) 

Destination: ff01::2 (ff01::2) 
a Fragmentation Header 

Next header: lCMPv6 (0x3a) 

0000 0101 1010 1... = Offset: 181 (OxOObS) 

.1 = More Fragment: Yes 

identif1cation: 0x00000010 
Reassembled IPv6 in frame: 3 
a Data (1448 bytes) 

Data: 6f70717273747576776162636465666768696a6b6c6d6e6f... 

[Length: 1448] 


Figure 5-14 The second fragment of an ICMPv6 Echo Request contains 1448 IPv6 payload bytes 
including the 8-byte Fragment header. The presence of the Fragment header indicates 
that the overall datagram was fragmented at the source, and the Offset field of 181 indi¬ 
cates that this fragment contains data starting at byte offset 1448. The More Fragments 
bit field being set indicates that other fragments are needed to reassemble the datagram. 
All fragments from the same original datagram contain the same Identification field (2 
in this case). 


In Eigure 5-14 we see the IPv6 header, with payload length 1448 bytes, as 
expected. The Next Header field contains the value 44 (0x2c) we saw in Table 5-5, 
indicating that a Eragment header follows fhe IPv6 header. The Eragmenf header 
indicafes fhaf fhe following header is for ICMPv6, meaning fhere are no more 
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extension headers. Also, the Offset field is 181, meaning this fragment contains 
data at byte offset 1448 in the original datagram. We know it is not the last frag¬ 
ment because the More Fragments field is sef (displayed as Yes by Wireshark). Fig¬ 
ure 5-15 shows fhe final fragmenf of fhe inifial ICMPv6 Echo Requesf dafagram. 


Q ip6frag-win7.td 




BB®! 

File Edit View Go 

Capture Analyze Statistics Telephony lods 

nternals 

Help 

Bt M m mu 

B 0 X 3 a : a « 

# « ? ^ 

|B|a; stGtetn a 

No, Tune 
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Destination 

Protocc 

Info 

1 0.000000 

2 0.001225 

f e8o;;6cib7 :c661:2f 58:e6f7 
fe80::8db7:c661:2f58:e6f7 

ff01:: 2 
ff01::2 

IPv6 

IPv6 

IPv6 fragment (nxt=ICMPv6 (0x3a) off=0 id=0xl0,i 

IPv6 fragment (nxt=ICMPv6 (0x3a) off=1448 id=0xl0) 

3 0.002145 

fe80::8db7:c661:2f58:e6f7 

ff01::2 

ICMPv6 Echo (ping) request id=0x0001, seq=43 | 

4 4.838179 

5 4.839406 

f e80:: .5clb7:c661:2f 58:e6f 7 
fe80::8db7:c661:2f58:e6f7 

ffOl::2 
ff01:: 2 

IPv6 

IPv6 

IPv6 fragment t,nxt»ICMPv6 t0x3a,i oTf-O icl»0xl2,i 1 

IPv6 fragment (nxt»ICMPv6 (0x3a) off»1448 id»0xl2) | 


6 4 .•84Q326Mfie8QM8db7Jc66»Zf.58Te6f7~f.fi01^ 

7 9.845752 fe80: :8db7:c661:2f 58:e6f7 ffOl: 

8 9.846973 fe80: :8db7:c661:2f 58:e6f7 ffOl: 

9I9I84|;7^8.8.0*te8.0S8.ab7ie.661tZf|58.Te.6t7JlttQli. 

10 14.837626 fe80: :8db7:c661:2f 58:e6f7 ffOl: 

11 14.838850 fe80: :8db7:c661:2f58:e6f7 ffOl: 

12 14.841029 fe80: f8.dbae.661*Zi58.Te6f7 ffOl: 


ICMPV6 Echo^CpIi^g) request id-Ox.0.0.01.,. seq=44 ___ ” 

IPv6 IPv6 fragment (nxt=ICMPv6 (0x3a) off=0 id»0xl4) 

IPv6 IPv6 fragment (nxt=ICMPv6 (0x3a) off=1448 id»0xl4) 

ICMPV6 Echbi(jpjlingSl re'^est id=0xQ.0.01~ seqsil.5___ ” 

IPv6 IPv6 fragment (nxt=ICMPv6 (0x3a) off=0 id=0xl6) 

IPv6 IPv6 fragment Cnxt=ICMPv6 (0x3a) off=1448 id=0xl6) 

ICMPV6 Echo (piTig) re'qDest id=0x0001, seq=46 ” 


111 Frame 3: 1126 bytes on wire (9008 bits), 1126 bytes captured (9008 bits) 

ffl Ethernet II, Src: 20:6a:8a:31:7c:74 (20:6a:8a:31:7c:74), Dst: 33:33:00:00:00:02 (33:33:00:00:00:02) 

Q Internet Protocol version 6, Src: fe80::8db7:c661:2f58:e6f7 (fe80::8db7:c661:2f58:e6f7), Dst: ff01::2 (ffOl 
a 0110 .... = version: 6 

ffl .... 0000 0000 . = Traffic class: 0x00000000 

. 0000 0000 0000 0000 0000 = Flowlabel: 0x00000000 

Payload length: 1072 

Next header: IPv6 fragment (0x2c) 

Hop limit: 128 

Source: fe80::8db7:c661:2f58:e6f7 (fe80::8db7:c661:2f58:e6f7) 

Destination: ff01::2 (ff01::2) 
a Fragmentation Header 

Next header: icmpv 6 (0x3a) 

0000 1011 0101 0... = Offset: 362 (0x016a) 

.0 = More Fragment: no 

Identification: 0x00000010 

Q [3 IPv6 Fragments (3960 bytes): #1(1448), #2(1448), #3(1064)] 
fFrame: 1. n.ivload: 0-1447 ['1448 bvTesil 
[Frame: 2. payload: 144S-2895 (1448 bytesll 
[Frame: 3. payload: 2896-3959 ('1Q64 bytesll 
[Fragment count: 3] 

[Reassembled IPv6 length: 3960] 

Q internet control Message protocol v6 
Type: Echo (ping) request (128) 
code; 0 

checksum: 0x3562 [correct] 

Identifier: 0x0001 
sequence: 43 
Q Data (3952 bytes) 


Figure 5-15 The last fragment of the first ICMPv6 Echo Request datagram has an offset of 362 * 8 
= 2896 and payload length of 1072 bytes (1064 bytes of the original datagram's payload 
plus 8 bytes of Fragment header). The More Fragments bit field being set to 0 indicates 
that this is the last fragment, and the original datagram's total payload length is 2896 
+ 1064 = 3960 bytes (3956 bytes of ICMP data plus 8 bytes for the lCMPv6 header; see 
Chapter 8). 


In Figure 5-15 we see fhaf fhe Offset field has fhe value 362, buf fhis is in 8-byfe 
unifs, meaning fhaf fhe byfe offsef relafive fo fhe original dafagram is 362 8 = 
2896. The Total Length field has fhe value 1072, which includes 8 byfes for fhe Frag¬ 
menf header. Wireshark compufes fhe fragmenfafion paffern for us, indicafing 
fhaf fhe firsf and second fragmenfs confained fhe firsf and second sefs of 1448 
byfes, and fhe final fragmenf confained 1064. All in all, fhe fragmenfafion process 
added 40*2 + 8*3 = 104 byfes fo be carried by fhe nefwork layer (fwo addifional 
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IPv6 headers plus an 8-byte Fragment header for each fragment). If we add link- 
layer overhead, fhe fofal comes fo 104 + (2*18) = 140 byfes. (Each new Efhernef 
frame includes a 14-byfe header and a 4-byfe CRC.) 


5.4 IP Forwarding 

Concepfually, IP forwarding is simple, especially for a hosf. If fhe desfinafion is 
direcfly connecfed fo fhe hosf (e.g., a poinf-fo-poinf link) or on a shared nefwork 
(e.g., Efhernef), fhe IP dafagram is senf direcfly fo fhe desfinafion—a roufer is nof 
required or used. Ofherwise, fhe hosf sends fhe dafagram fo a single roufer (called 
fhe default roufer) and lefs fhe roufer deliver fhe dafagram fo ifs desfinafion. This 
simple scheme handles mosf hosf configurafions. 

In fhis secfion we invesfigafe fhe defails of fhis simple sifuafion and also how 
IP forwarding works when fhe sifuafion is nof as simple. We begin by nofing fhaf 
mosf hosfs today can be configured fo be roufers as well as hosfs, and many home 
nefworks use an Infernef-connecfed PC fo acf as a roufer (and also a firewall, as 
we discuss in Chapfer 7). Whaf differenfiafes a hosf from a roufer fo IP is how IP 
dafagrams are handled: a hosf never forwards dafagrams if does nof originafe, 
whereas roufers do. 

In our general scheme, fhe IP protocol can receive a dafagram eifher from 
anofher protocol on fhe same machine (TCP, UDP, efc.) or from a nefwork inter¬ 
face. The IP layer has some informafion in memory, usually called a routing table or 
forwarding table, which if searches each fime if receives a dafagram fo send. When a 
dafagram is received from a nefwork interface, IP firsf checks if fhe desfinafion IP 
address is one of ifs own IP addresses (i.e., one of fhe IP addresses associafed wifh 
one of ifs nefwork inferfaces) or some ofher address for which if should receive 
fraffic such as an IP broadcasf or mulficasf address. If so, fhe dafagram is delivered 
fo fhe profocol module specified by fhe Protocol field in fhe IPv4 header or Next 
Header field in fhe IPv6 header. If fhe dafagram is nof desfined for one of fhe IP 
addresses being used locally by fhe IP module, fhen (1) if fhe IP layer was config¬ 
ured fo acf as a roufer, fhe dafagram is forwarded (fhaf is, handled as an oufgoing 
dafagram as described in Secfion 5.4.2); or (2) fhe dafagram is silenfly discarded. 
Under some circumsfances (e.g., no route is known in case 1), an ICMP message 
may be senf back fo fhe source indicafing an error condifion. 

5.4.1 Forwarding Table 

The IP profocol sfandards do nof dicfafe fhe precise dafa required fo be in a for¬ 
warding fable, as fhis choice is leff up fo fhe implementor of fhe IP profocol. Nev- 
erfheless, several key pieces of informafion are generally required fo implemenf 
fhe forwarding fable for IP, and we shall discuss fhese now. Each enfry in fhe 
roufing or forwarding fable confains fhe following informafion fields, af leasf 
concepfually: 
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• Destination: This contains a 32-bit field (or 128-bit field for IPv6) used for 
mafching fhe resulf of a masking operafion (see fhe nexf bullefed ifem). 
The desfinafion can be as simple as zero, for a "defaulf roufe" covering all 
desfinafions, or as long as fhe full lengfh of an IP address, in fhe case of a 
"hosf roufe" fhaf describes only a single desfinafion. 

• Mask: This confains a 32-bif field (128-bif field for IPv6) applied as a bifwise 
AND mask fo fhe desfinafion IP address of a dafagram being looked up in 
fhe forwarding fable. The masked resulf is compared wifh fhe sef of desfi¬ 
nafions in fhe forwarding fable enfries. 

• Next-hop: This contains the 32-bit IPv4 address or 128-bit IPv6 address of 
fhe nexf IP enfify (roufer or hosf) fo which fhe dafagram should be senf. The 
nexf-hop enfify is fypically on a nefwork shared wifh fhe sysfem perform¬ 
ing fhe forwarding lookup, meaning fhe fwo share fhe same nefwork prefix 
(see Chapfer 2). 

• Interface: This contains an identifier used by fhe IP layer fo reference fhe 
nefwork inferface fhaf should be used fo send fhe dafagram fo ifs nexf hop. 
For example, if could refer fo a hosf's 802.11 wireless inferface, a wired Efh- 
ernef inferface, or a PPP inferface associafed wifh a serial porf. If fhe for¬ 
warding sysfem is also fhe sender of fhe IP dafagram, fhis field is used in 
selecfing which source IP address fo use on fhe oufgoing dafagram (see 
Secfion 5.6.2.1). 

IP forwarding is performed on a hop-by-hop basis. As we can see from fhis 
forwarding fable informafion, fhe roufers and hosfs do nof confain fhe complefe 
forwarding pafh fo any desfinafion (excepf, of course, fhose desfinafions fhaf are 
direcfly connecfed fo fhe hosf or roufer). IP forwarding provides fhe IP address of 
only fhe nexf-hop enfify fo which fhe dafagram is senf. If is assumed fhaf fhe nexf 
hop is really "closer" fo fhe desfinafion fhan fhe forwarding sysfem is, and fhaf 
fhe nexf-hop roufer is direcfly connecfed fo (i.e., shares a common nefwork pre¬ 
fix wifh) fhe forwarding sysfem. If is also generally assumed fhaf no "loops" are 
consfrucfed befween fhe nexf hops so fhaf a dafagram does nof circulafe around 
fhe nefwork unfil ifs TTL or hop limif expires. The job of ensuring correcfness of 
fhe roufing fable is given fo one or more roufing protocols. Many differenf routing 
protocols are available fo do fhis job, including RIP, OSPF, BGP, and IS-IS, fo name a 
few (see, for example, [DC05] for more defail on roufing protocols). 

5.4.2 IP Forwarding Actions 

When fhe IP layer in a hosf or roufer needs fo send an IP dafagram fo a nexf-hop 
roufer or hosf, if firsf examines fhe desfinafion IP address (D) in fhe dafagram. 
Using fhe value D, fhe following longest prefix match algorifhm is execufed on fhe 
forwarding fable: 
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1. Search the table for all entries for which fhe following properly holds: 
(D ^ m.) = d., where m. is fhe value of fhe mask field associafed wifh fhe for- 
warding enfry e. having index j, and d is fhe value of fhe desfinafion field 
associafed wifh e.. This means fhaf fhe desfinafion IP address D is bifwise 
ANDed wifh fhe mask in each forwarding fable enfry (m.), and fhe resulf is 
compared againsf fhe desfinafion in fhe same forwarding fable enfry (d). 
If fhe properly holds, fhe enfry {e. here) is a "mafch" for fhe desfinafion ip 
address. When a mafch happens, fhe algorifhm nofes fhe enfry index (j 
here) and how many bifs in fhe mask m. were sef fo 1. The more bifs sef fo 
1, fhe "beffer" fhe mafch. 

2. The besf mafching enfry e^, (i.e., fhe one wifh fhe largesf number of 1 bifs in 
ifs mask mj is selecfed, and ifs nexf-hop field n^, is used as fhe nexf-hop IP 
address in forwarding fhe dafagram. 

If no mafches in fhe forwarding fable are found, fhe dafagram is undeliverable. 
If fhe undeliverable dafagram was generafed locally (on fhis hosf), a "hosf unreach¬ 
able" error is normally refurned fo fhe applicafion fhaf generafed fhe dafagram. On 
a roufer, an ICMP message is normally sen! back fo fhe hosf fhaf sen! fhe dafagram. 

In some circumsfances, more fhan one enfry may mafch an equal number of 1 
bifs. This can happen, for example, when more fhan one defaulf roufe is available 
(e.g., when affached fo more fhan one ISP, called mulfihoming). The end-sysfem 
behavior in such cases is nof sef by sfandards and is insfead specific fo fhe operaf- 
ing sysfem's protocol implemenfafion. A common behavior is for fhe system fo sim¬ 
ply choose fhe firsf mafch. More sophisficafed systems may affempf fo load-balance 
or splif fraffic across fhe mulfiple roufes. Sfudies suggesf fhaf mulfihoming can be 
beneficial nof only for large enterprises, buf also for residenfial users [THL06]. 

5.4.3 Examples 

To gef a solid undersfanding of how IP forwarding works bofh in fhe simple local 
environmenf (e.g., same LAN) and in fhe somewhaf more complicated mulfihop 
(global Infernef) environmenf, we look af fwo cases. The firsf case, where all sys- 
fems are using fhe same nefwork prefix, is called direct delivery, and fhe ofher case 
is called indirect delivery (see Figure 5-16). 

5.4.3.1 Direct Delivery 

Firsf consider a simple example. Our Windows XP hosf (wifh IPv4 address S and 
MAC address S), which we will jusf call S, has an IP dafagram fo send fo our Linux 
hosf (IPv4 address D, MAC address D), which we will call D. These systems are 
interconnected using a swifch. Bofh hosfs are on fhe same Efhernef (see inside 
fronf cover). Figure 5-16 (fop) shows fhe delivery of fhe dafagram. When fhe IP 
layer in S receives a dafagram fo send from one of fhe upper layers such as TCP or 
UDP, if searches ifs forwarding fable. We would expecf fhe forwarding fable on S 
fo confain fhe informafion shown in Table 5-8. 
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Figure 5-16 Direct delivery does not require the presence of a router—IP datagrams are encapsu¬ 
lated in a link-layer frame that directly identifies the source and destination. Indirect 
delivery involves a router—data is forwarded to the router using the router's link-layer 
address as the destination link-layer address. The router's IP address does not appear 
in the IP datagram (unless the router itself is the source or destination, or when source 
routing is used). 


In Table 5-8, the destination IPv4 address D (10.0.0.9) matches both the first 
and second forwarding fable enfries. Because if mafches fhe second enfry bef- 
fer (25 bifs insfead of none), fhe "gafeway" or nexf-hop address is 10.0.0.100, fhe 
address S. Thus, fhe gafeway porfion of fhe enfry confains fhe address of fhe send¬ 
ing hosf's own nefwork inferface (no roufer is referenced), indicafing fhaf direcf 
delivery is fo be used fo send fhe dafagram. 


Table 5-8 The (unicast) IPv4 forwarding table at host S contains only two entries. Host S is config¬ 
ured with IPv4 address and subnet mask 10.0.0.100/25. Datagrams destined for addresses 
in the range 10.0.0.1 through 10.0.0.126 use the second forwarding table entry and are sent 
using direct delivery. All other datagrams use the first entry and are given to router R 
with IPv4 address 10.0.0.1. 


Destination 

Mask 

Gateway (Next Hop) 

Interface 

o.o.o.o 

O.O.O.O 

10.0.0.1 

10.0.0.100 

10.0.0.0 

255.255.255.128 

10.0.0.100 

10.0.0.100 
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The datagram is encapsulated in a lower-layer frame destined for fhe fargef 
hosf D. If fhe lower-layer address of fhe fargef hosf is unknown, fhe ARP protocol 
(for IPv4; see Chapter 4) or Neighbor Solicifafion (for IPv6; see Chapter 8) opera- 
fion may be invoked af fhis poinf to determine fhe correcf lower-layer address, D. 
Once known, fhe desfinafion address in fhe dafagram is D's IPv4 address (10.0.0.9), 
and D is placed in fhe Destination IP Address field in fhe lower-layer header. The 
swifch delivers fhe frame to D based solely on fhe link-layer address D; if pays no 
affenfion fo fhe IP addresses. 

5.4.3.2 Indirect Delivery 

Now consider anofher example. Our Windows hosf has an IP dafagram fo send 
fo fhe hosf ftp.uu.net, whose IPv4 address is 192.48.96.9. Figure 5-16 (bottom) 
shows fhe concepfual pafh of fhe dafagram fhrough four roufers. Firsf, fhe Win¬ 
dows machine searches ifs forwarding fable buf does nof find a mafching prefix 
on fhe local nefwork. If uses ifs defaulf roufe enfry (which mafches every desfina¬ 
fion, buf wifh no 1 bifs af all). The defaulf enfry indicates fhaf fhe appropriafe nexf- 
hop gafeway is 10.0.0.1 (fhe "a side" of fhe router Rl). This is a fypical scenario for 
a home nefwork. 

Recall fhaf in fhe direcf delivery case, fhe source and desfinafion IP addresses 
correspond fo fhose associated wifh fhe source and desfinafion hosfs. The same 
is frue for fhe lower-layer (e.g., Efhernef) addresses. In indirecf delivery, fhe 
IP addresses correspond fo fhe source and desfinafion hosfs as before, buf fhe 
lower-layer addresses do nof. Instead, fhe lower-layer addresses determine which 
machines receive fhe frame confaining fhe dafagram on a per-hop basis. In fhis 
example, fhe lower-layer address needed is fhe Efhernef address of fhe nexf-hop 
roufer RTs a-side inferface, fhe lower-layer address corresponding fo IPv4 address 
10.0.0.1. This is accomplished by ARP (or a Neighbor Solicifafion requesf if fhis 
example were using IPv6) on fhe nefwork inferconnecfing S and Rl. Once Rl 
responds wifh ifs a-side lower-layer address, S sends fhe dafagram fo Rl. Delivery 
from S fo Rl fakes place based on processing only fhe lower-layer headers (more 
specifically, fhe lower-layer desfinafion address). Upon receipf of fhe dafagram, Rl 
checks ifs forwarding fable. The informafion in Table 5-9 would be fypical. 


Table 5-9 The forwarding table at Rl indicates that address translation should be performed for 
traffic. The router has a private address on one side (10.0.0.1) and a public address on the 
other (70.231.132.85). Address translation is used to make datagrams originating on the 
10.0.0.0/25 network appear to the Internet as though they had been sent from 70.231.132.85. 


Destination 

Mask 

Gateway (Next Hop) 

Interface 

Note 

o.o.o.o 

O.O.O.O 

70.231.159.254 

70.231.132.85 

NAT 

10.0.0.0 

255.255.255.128 

10.0.0.100 

10.0.0.1 

NAT 
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When R1 receives the datagram, it realizes that the datagram's destination IP 
address is not one of its own, so it forwards fhe dafagram. Ifs forwarding fable is 
searched and fhe defaulf enfry is used. The defaulf enfry in fhis case has a nexf 
hop wifhin fhe ISP servicing fhe nefwork, 70.231.159.254 (fhis is R2's a-side infer- 
face). This address happens fo be wifhin SBC's DSL nefwork called by fhe some- 
whaf cumbersome name adsl-70-231-159-254.dsl.snfc21.sbcglobal.net. 
Because fhis roufer is in fhe global Infernef and fhe Windows machine's source 
address is fhe privafe address 10.0.0.100, R1 performs Network Address Translation 
(NAT) on fhe dafagram fo make if roufable on fhe Infernef. The NAT operafion 
resulfs in fhe dafagram having fhe new source address 70.231.132.85, which cor¬ 
responds fo RTs b-side inferface. Nefworks fhaf do nof use privafe addressing (e.g., 
ISPs and larger enferprises) avoid fhe lasf sfep and fhe original source address 
remains unchanged. NAT is described in more defail in Chapfer 7. 

When roufer R2 (inside fhe ISP) receives fhe dafagram, if goes fhrough fhe 
same sfeps fhaf fhe local roufer R1 did (excepf for fhe NAT operafion). If fhe dafa¬ 
gram is nof desfined for one of ifs own IP addresses, fhe dafagram is forwarded. 
In fhis case, fhe roufer usually has nof only a defaulf roufe buf several ofhers, 
depending on ifs connecfivify fo fhe resf of fhe Infernef and ifs own local policies. 

Nofe fhaf IPv6 forwarding varies only slighfly from convenfional IPv4 for¬ 
warding. Aside from fhe larger addresses, IPv6 uses a slighfly differenf mecha¬ 
nism (Neighbor Solicifafion messages) fo ascerfain fhe lower-layer address of ifs 
nexf hop. If is described in more defail in Chapfer 8, as if is parf of ICMPv6. In 
addifion, IPv6 has bofh link-local addresses and global addresses (see Chapfer 2). 
While global addresses behave like regular IP addresses, link-local addresses can 
be used only on fhe same link. In addifion, because all fhe link-local addresses 
share fhe same IPv6 prefix (fe80::/10), a mulfihomed hosf may require user inpuf 
fo defermine which inferface fo use when sending a dafagram desfined for a link- 
local desfinafion. 

To illusfrafe fhe use of link-local addresses, we sfarf wifh our Windows XP 
machine, assuming IPv6 is enabled and operafional: 


C:\> ping6 fe80::204:5aff:fe9f:9e80 

Pinging fe80::204:5aff:fe9f:9e80 with 32 bytes of data: 

No route to destination. 

Specify correct scope-id or use -s to specify source address. 


C:\> ping6 fe80::204:5aff:fe9f:9e80%6 

Pinging fe80::204:5aff:fe9f:9e80%6 

from fe80::205:4eff:fe4a:24bb%6 with 32 bytes of data: 

Reply from fe80::204:5aff:fe9f:9e80%6: bytes=32 time=lms 
Reply from fe80::204:5aff:fe9f:9e80%6: bytes=32 time=lms 
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Reply from feSO::204:5aff:fe9f:9e80%6: bYtes=32 time=lms 
Reply from feSO::204:5aff:fe9f:9e80%6: bytes=32 time=lms 

Ping statistics for fe80::204:5aff:fe9f:9e80%6: 

Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), 
Approximate round trip times in milli-seconds: 

Minimum = 1ms, Maximum = 1ms, Average = 1ms 


Here we see that failing to specify which inferface fo use for oufbound link- 
local fraffic resulfs in an error. In Windows XP, we can specify eifher a scope ID or 
a source address. In fhis example we specify fhe scope ID as an inferface number 
using fhe % 6 exfension fo fhe desfinafion address. This informs fhe sysfem fo use 
inferface number 6 as fhe correcf inferface when sending fhe ping fraffic. 

To see fhe pafh faken fo an IP desfinafion, we can use fhe traceroute pro¬ 
gram (called tracert on Windows, which has a slighfly differenf sef of opfions) 
wifh fhe -n opfion fo nof converf IP addresses fo names: 


Linux% traceroute -n ftp.uu.net 

traceroute to ftp.uu.net (192.48.96.9), 30 hops max, 38 byte packets 

1 70.231.159.254 9.285 ms 8.404 ms 8.887 ms 

2 206.171.134.131 8.412 ms 8.764 ms 8.661 ms 

3 216.102.176.226 8.502 ms 8.995 ms 8.644 ms 

4 151.164.190.185 8.705 ms 8.673 ms 9.014 ms 

5 151.164.92.181 9.149 ms 9.057 ms 9.537 ms 

6 151.164.240.134 9.680 ms 10.389 ms 11.003 ms 

7 151.164.41.10 11.605 ms 37.699 ms 11.374 ms 

8 12.122.79.97 13.449 ms 12.804 ms 13.126 ms 

9 12.122.85.134 15.114 ms 15.020 ms 13.654 ms 

MPLS Label=32307 CoS=5 TTL=1 S=0 

10 12.123.12.18 16.011 ms 13.555 ms 13.167 ms 

11 192.205.33.198 15.594 ms 15.497 ms 16.093 ms 

12 152.63.57.102 15.103 ms 14.769 ms 15.128 ms 

13 152.63.34.133 77.501 ms 77.593 ms 76.974 ms 

14 152.63.38.1 77.906 ms 78.101 ms 78.398 ms 

15 207.18.173.162 81.146 ms 81.281 ms 80.918 ms 

16 198.5.240.36 77.988 ms 78.007 ms 77.947 ms 

17 198.5.241.101 81.912 ms 82.231 ms 83.115 ms 

This program lisfs each of fhe IP hops fraversed while sending a series of dafa- 
grams fo fhe desfinafion ftp.uu.net (192.48.96.9). The traceroute program 
uses a combinafion of UDP dafagrams (wifh increasing TTL over fime) and ICMP 
messages (used fo defecf each hop when fhe UDP dafagrams expire) fo accomplish 
ifs fask. Three UDP packefs are senf af each TTL value, providing fhree round- 
frip-fime measuremenfs fo each hop. Tradifionally, traceroute has carried only 
IP informafion, buf here we also see fhe following line: 


MPLS Label=32307 CoS=5 TTL=1 S=0 
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This indicates that Multiprotocol Label Switching (MPLS) [RFC3031] is being used 
on the path, and the label ID is 32307, class of service is 5, TTL is 1, and the message 
is not the bottom of fhe MPLS label slack (S = 0; see [RFC4950]). MPLS is a form 
of link-layer nefwork capable of carrying mulfiple nefwork-layer profocols. Ifs 
inferacfion wifh ICMP is described in [RFC4950], and ifs handling of IPv4 packefs 
confaining opfions is described in [RFC6178]. Many nefwork operafors use if for 
fraffic engineering purposes (i.e., confrolling where nefwork fraffic flows fhrough 
fheir nef works). 

5.4.4 Discussion 

In fhe examples we have jusf seen fhere are a few key poinfs fhaf should be kepf in 
mind regarding fhe operafion of IP unicasf forwarding: 

1. Mosf of fhe hosfs and roufers in fhis example used a defaulf roufe consisf- 
ing of a single forwarding fable enfry of fhis form: mask 0, desfinafion 0, 
nexf hop <some IP address>. Indeed, mosf hosfs and mosf roufers af fhe 
edge of fhe Infernef can use a defaulf roufe for everyfhing ofher fhan desfi- 
nafions on local nefworks because fhere is only one inferface available fhaf 
provides connecfivify fo fhe resf of fhe Infernef. 

2. The source and desfinafion IP addresses in fhe dafagram never change 
once in fhe regular Infernef. This is always fhe case unless eifher source 
roufing is used, or when ofher funcfions (such as NAT, as in fhe example) 
are encounfered along fhe dafa pafh. Forwarding decisions af fhe IP layer 
are based on fhe desfinafion address. 

3. A differenf lower-layer header is used on each link fhaf uses addressing, 
and fhe lower-layer desfinafion address (if presenf) always confains fhe 
lower-layer address of fhe nexf hop. Therefore, lower-layer headers rou- 
finely change as fhe dafagram is moved along each hop foward ifs des¬ 
finafion. In our example, bofh Efhernef LANs encapsulafed a link-layer 
header confaining fhe nexf hop's Efhernef address, buf fhe DSL link did 
nof. Lower-layer addresses are normally obfained using ARP (see Chapfer 
4) for IPv4 and ICMPvb Neighbor Discovery for IPv6 (see Chapfer 8). 


5.5 Mobile IP 

So far we have discussed fhe convenfional ways fhaf IP dafagrams are forwarded 
fhrough fhe Infernef, as well as privafe nefworks fhaf use IP. One assumpfion of fhe 
model is fhaf a hosf's IP address shares a prefix wifh ifs nearby hosfs and roufers. If 
such a hosf should move ifs poinf of nefwork affachmenf, yef remain connecfed fo 
fhe nefwork af fhe link layer, all of ifs upper-layer (e.g., TCP) connecfions would fail 
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because either its IP address would have to be changed or routing would not deliver 
packets to the (moved) host properly. A multiyear (actually, multidecade!) effort 
known as Mobile IP addresses this issue. (Other protocols have also been suggested; 
see [RFC6301].) Although there are versions of Mobile IP for bofh IPv4 [RFC5944] 
(MIPv4) and IPv6 [RFC6275], we focus on Mobile IPv6 (called MIPv6) because if is 
more flexible and somewhaf easier fo explain. Also, if currenfly appears more likely 
fo be deployed in fhe quickly growing smarfphone markef. Nofe fhaf we do nof 
discuss MIPv6 comprehensively; if is sufficienfly complex fo merif abook on ifs own 
(e.g., [RC05]). Nonefheless, we will cover ifs basic concepfs and principles. 

Mobile IP is based on fhe idea fhaf a hosf has a "home" nefwork buf may visif 
ofher nefworks from fime fo fime. While af home, ordinary forwarding is per¬ 
formed, according fo fhe algorifhms discussed in fhis chapfer. When away from 
home, fhe hosf keeps fhe IP address if would ordinarily use af home, buf some 
special roufing and forwarding fricks are used fo make fhe hosf appear fo fhe 
nefwork, and fo fhe ofher sysfems wifh which if communicafes, as fhough if is 
affached fo ifs home nefwork. The scheme depends on a special type of roufer 
called a "home agenf" fhaf helps provide roufing for mobile nodes. 

Mosf of fhe complexify in MIPv6 involves signaling messages and how fhey 
are secured. These messages use various forms of fhe Mobilify exfension header 
{Next Header field value 135 in Table 5-5, offen jusf called fhe mobility header), so 
Mobile IP is, in effecf, a special protocol of ifs own. The lANA mainfains a regisfry 
of fhe various header fypes (17 are reserved currenfly), along wifh many ofher 
paramefers associated wifh MIPv6 [MP]. We shall focus on fhe basic messages 
specified in [RFC6275]. Ofher messages are used fo implemenf "fasf handovers" 
[RFC5568], changing of fhe home agenf [RFC5142], and experimenfs [RFC5096]. 
To undersfand MIPv6, we begin by infroducing fhe basic model for IP mobilify 
and fhe associated terminology 

5.5.1 The Basic Model: Bidirectional Tunneling 

Figure 5-17 shows fhe enfifies involved in making MIPv6 work. Much of fhe termi¬ 
nology also applies fo MIPv4 [RFC5944]. A hosf fhaf mighf move is called a mobile 
node (MN), and fhe hosfs wifh which if is communicafing are called correspon¬ 
dent nodes (CNs). The MN is given an IP address chosen from fhe nefwork prefix 
used in ifs home nefwork. This address is known as ifs home address (HoA). When 
if fravels fo a visited nefwork, if is given an addifional address, called ifs care-of 
address (CoA). In fhe basic model, whenever a CN communicafes wifh an MN, 
fhe fraffic is roufed fhrough fhe MN's home agent (HA). HAs are a special type of 
roufer deployed in fhe nefwork infrasfrucfure like ofher imporfanf sysfems (e.g., 
roufers and Web servers). The associafion befween an MN's HoA and ifs CoA is 
called a binding for fhe MN. 

The basic model (see Figure 5-17) works in cases where an MN's CNs do nof 
engage in fhe MIPv6 protocol. This model is also used for nefwork mobilify (called 
"NEMO" [RFC3963]), when an enfire nefwork is mobile. When fhe MN (or mobile 
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Figure 5-17 Mobile IP supports the ability of nodes to change their point of network attachment and keep 
network connections operating. The mobile node's home agent helps to forward traffic for mobiles 
it serves and also plays a role in route optimization, which can substantially improve routing per¬ 
formance by allowing mobile and correspondent nodes to communicate directly. 


network router) attaches to a new point in the network, it receives its CoA and 
sends a binding update message to its HA. The HA responds with a binding acknowl¬ 
edgment. Assuming that all goes well, traffic befween fhe MN and CNs is fhereaf- 
fer roufed fhrough fhe MN's HA using a fwo-way form of IPv6 packef funneling 
[RFC2473] called bidirectional tunneling. These messages are ordinarily profecfed 
using IPsec wifh fhe Encapsulating Security Payload (ESP) (see Chapfer 18). Doing so 
ensures fhaf an HA is nof fooled info accepfing a binding updafe from a fake MN. 

5.5.2 Route Optimization (RO) 

Bidirectional tunneling makes MIPv6 work in a relatively simple way, and with 
CNs that are not Mobile-IP-aware, but the routing can be extremely inefficient, 
especially if the MN and CNs are near each other but far away from the MN's HA. 
To improve upon the inefficient routing that may occur in basic MIPv6, a process 
called route optimization (RO) can be used, provided it is supported by the various 
nodes involved. As we shall see, the methods used to ensure that RO is secure and 
useful are somewhat complicated. We shall sketch only its basic operations. For a 
more detailed discussion, see [RFC6275] and [RFC4866]. For a discussion of the 
design rationale behind RO security, see [RFC4225]. 
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When used, RO involves a correspondent registration whereby an MN notifies 
its CNs of its current CoA to allow routing to take place without help from the HA. 
RO operates in two parts: one part involves establishing and maintaining the reg¬ 
istration bindings; another involves the method used to exchange datagrams once 
all bindings are in place. To establish a binding with its CNs, an MN must prove 
to each CN that it is the proper MN. This is accomplished by a Return Routability 
Procedure (RRP). The messages that support RRP are not protected using IPsec as 
are the messages between an MN and its HA. Expecting IPsec to work between 
an MN and any CN was believed to be too unreliable (IPv6 requires IPsec sup¬ 
port but does not require its use). Although the RRP is not as strong as IPsec, it 
is simpler and covers most of the security threats of concern to the designers of 
Mobile IP. 

The RRP uses the following mobility messages, all of which are subtypes of the 
IPv6 Mobility extension header: Home Test Init (HoTI), Home Test (HoT), Care-of 
Test Init (CoTI), Care-of Test (CoT). These messages verify to a CN that a particu¬ 
lar MN is reachable both at its home address (HoTI and HoT messages) and at its 
care-of addresses (CoTI and CoT messages). The protocol is shown in Figure 5-18. 



Figure 5-18 The return routability check procedure used in sending binding updates from an MN 
to a CN in order to enable route optimization. The check aims to demonstrate to a CN 
that an MN is reachable at both its home address and its care-of address. In this figure, 
messages routed indirectly are indicated with dashed arrows. The numbers indicate 
the ordering of messages, although the HoTI and CoTI messages can be sent by an MN 
in parallel. 
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To understand the RRP, we take the simplest case of a single MN, its HA, and 
a CN as shown in Figure 5-18. The MN begins by sending both a HoTI and CoTI 
message to the CN. The HoTI message is forwarded fhrough fhe HA on ifs way 
fo fhe CN. The CN receives bofh messages in some order and responds wifh a 
HoT and CoT message fo each, respecfively. The HoT message is senf fo fhe MN 
via fhe HA. Inside fhese messages are random bif sfrings called tokens, which fhe 
MN uses fo form a cryptographic key (see Chapter 18 for a discussion of fhe basics 
of cryptography and keys). The key is fhen used fo form aufhenficafed binding 
updates fhaf are senf fo fhe CN. If successful, fhe route can be opfimized and dafa 
can flow direcfly befween an MN and a CN, as shown in Figure 5-19. 



Figure 5-19 Once a binding is established between an MN and a CN, data flows directly between 
them. The direction from MN to CN uses an IPv6 Home Address Destination option. 
The reverse direction uses a type 2 Routing header (RH2). 

Once a binding has been established successfully, data may flow directly 
between an MN and its CNs without the inefficiency of bidirectional tunneling. 
This is accomplished using an IPv6 Destination option for traffic moving from 
the MN to a CN and a type 2 Routing header (RH2) for traffic headed in the 
reverse direction, as detailed in Figure 5-19. The packets from MN to CN include 
a Source IP Address field of the MN's Co A, which avoids problems associated with 
ingress filtering [RFC2827] that might cause packets containing the MN's HoA in 
the Source IP Address field to be dropped. The MN's HoA, contained in the Home 
Address option, is not processed by routers, so it passes through to the CN with¬ 
out modification. On the return path, packets are destined for the MN's CoA. After 
successfully receiving a returning packet, the MN processes the extension headers 
and replaces the destination IP address with the HoA contained in the RH2. The 
resulting packet is delivered to the rest of the MN's protocol stack, so applications 
"believe" they are using the MN's HoA instead of its CoA for establishing connec¬ 
tions and other actions. 
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5.5.3 Discussion 

There are a number of issues with Mobile IP. It is designed to address a certain 
type of mobility in which a node's IP address may change while the underlying 
link layer remains more or less connected. This type of usage is not common for 
portable computers, which tend to shut down or be put to sleep when being moved 
from place to place. The usage model requiring Mobile IP (and MIPv6 in particu¬ 
lar) is more likely to be a large number of smartphones that use IP. Such devices 
may be running real-time applications (e.g., VoIP) that have latency requirements. 
Consequently, several approaches are being explored to reduce the amount of time 
required to execute binding updates. These include fast handovers [RFC5568], a 
modification to MIPv6 called Hierarchical MIPv6 (HMIPv6) [RFC5380], and a 
modification in which the mobile signaling ordinarily required of an MN is per¬ 
formed by a proxy (called proxy MIPv6 or PMIPv6 [RFC5213]). 

5.6 Host Processing of IP Datagrams 

Although routers do not ordinarily have to consider which IP addresses to place 
in the Source IP Address and Destination IP Address fields of the packets they for¬ 
ward, hosts must consider both. Applications such as Web browsers may attempt 
to make connections to a named host or server that can have multiple addresses. 
The client system making such connections may also have multiple addresses. 
Thus, there is some question as to which address (and version of IP) should be 
used when sending a datagram. A more subtle point we shall explore is whether 
to accept traffic destined for a local IP address if it arrives on the wrong interface 
(i.e., one that is not configured with the destination address present in a received 
datagram). 

5.6.1 Host Models 

Although it may appear to be a straightforward decision to determine whether a 
received unicast datagram matches one of a host's IP addresses and should be pro¬ 
cessed, this decision depends on the host model of the receiving system [RFC1122] 
and is most relevant for multihomed hosts. There are two host models, the strong 
host model and the weak host model. In the strong host model, a datagram is accepted 
for delivery to the local protocol stack only if the IP address contained in the Desti¬ 
nation IP Address field matches one of those configured on the interface upon which 
the datagram arrived. In systems implementing the weak host model, the oppo¬ 
site is true—a datagram carrying a destination address matching any of the local 
addresses may arrive on any interface and is processed by the receiving protocol 
stack, irrespective of the network interface upon which it arrived. Host models also 
apply to sending behavior. That is, a host using the strong host model sends data¬ 
grams from a particular interface only if one of the interface's configured addresses 
matches the Source IP Address field in the datagram being sent. 



Section 5.6 Host Processing of IP Datagrams 


221 


Figure 5-20 illustrates a case where the host model becomes important. In 
this example, two hosts (A and B) are connected through the global Internet but 
also through a local network. If host A is set up to conform fo fhe sfrong hosf 
model, packefs if receives desfined for 203.0.113.1 from fhe Infernef or desfined for 
192.0.2.1 from fhe local nefwork are dropped. This sifuafion can arise, for example, 
if hosf B is configured fo obey fhe weak hosf model. If may choose fo send pack¬ 
efs fo 192.0.2.1 using fhe local nefwork (e.g., because doing so may be cheaper or 
fasfer). This sifuafion seems unforfunafe, as A receives whaf appear fo be perfecfly 
legifimafe packefs, yef drops fhem merely because if is operafing according fo fhe 
sfrong hosf model. So a reasonable quesfion would be: Why is if ever a good idea 
fo use fhe sfrong hosf model? 


Internet 



Figure 5-20 Hosts may be connected by more than one interface. In such cases, they must decide 
which addresses to use for the Source IP Address and Destination IP Address fields of the 
packets they exchange. The addresses used result from a combination of each host's for¬ 
warding table, application of an address selection algorithm [RFC 3484], and whether 
hosts are operating using a weak or strong host model. 


The attraction of using the strong host model relates to a security concern. 
Referring to Figure 5-20, consider a malicious user on the Internet who injects a 
packet destined for the address 203.0.113.2. This packet could also include a forged 
("spoofed") source IP address (e.g., 203.0.113.1). If the Internet cooperates in rout¬ 
ing such a packet to B, applications running on B may be tricked into believing 
they have received local traffic originating from A. This can have significant nega¬ 
tive consequences if such applications make access control decisions based on the 
source IP address. 

The host model, for both sending and receiving behavior, can be configured 
in some operating systems. In Windows (Vista and later), strong host behavior is 
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the default for sending and receiving for IPv4 and IPv6. In Linux, fhe IP behavior 
defaulfs fo fhe weak hosf model. BSD (including Mac OS X) uses fhe sfrong hosf 
model. In Windows, fhe following commands can be used fo configure weak hosf 
receive and send behavior, respecfively: 


C:\> netsh interface ipvX set interface <ifnaine> weakhostreceive=Yabled 


C:\> netsh interface ipvX set interface <ifnaine> weakhostsend=Yabled 


For fhese commands, <if naine> is replaced wifh fhe appropriafe inferface name; 
X is replaced wifh eifher 4 or 6, depending on which version of IP is being con¬ 
figured; and Y is replaced wifh eifher en or dis, depending on whefher weak 
behavior is fo be enabled or disabled, respecfively. 

5.6.2 Address Selection 

When a hosf sends an IP dafagram, if musf decide which of ifs IP addresses fo 
place in fhe Source IP Address field of fhe oufgoing dafagram, and which desfina- 
fion address fo use for a parficular desfinafion hosf if mulfiple addresses for if are 
known. In some cases fhe source address is already known because if is provided 
by an applicafion or because fhe packef is being senf in response fo a previously 
received packef on fhe same connecfion (see, for example, Chapfer 13 for how 
addresses are managed wifh TCP). 

In modern IP implemenfafions, fhe IP addresses used in fhe Source IP Address 
and Destination IP Address fields of fhe dafagram are selecfed using a sef of pro¬ 
cedures called source address selection and destination address selection. Historically, 
mosf Infernef hosfs had only one IP address for exfernal communicafion, so selecf- 
ing fhe addresses was nof ferribly difficulf. Wifh fhe advenf of mulfiple addresses 
per inferface and fhe use of IPv6 in which simulfaneous use of addresses wifh 
mulfiple scopes is normal, some procedure musf be used. The sifuafion is furfher 
complicafed when communicafion is fo fake place befween fwo hosfs fhaf imple- 
menf bofh IPv4 and IPv6 ("dual-sfack" hosfs; see [RFC4213]). Failure fo selecf fhe 
correcf addresses can lead fo asymmefric roufing, unwanted filfering, or discard¬ 
ing of packefs. Fixing such problems can be a challenge. 

[RFC3484] gives fhe rules for selecfing IPv6 defaulf addresses; IPv4-only hosfs 
do nof ordinarily have such complex issues. In general, applicafions can invoke 
special API operafions fo override fhe defaulf behavior, as suggested previously. 
Even fhen, fricky deploymenf sifuafions may sfill arise [RFC5220]. The defaulf 
rules in [RFC3484] are fo prefer source/desfinafion address pairs where fhe 
addresses are of fhe same scope, fo prefer smaller over larger scopes, fo avoid fhe 
use of femporary addresses when ofher addresses are available, and fo ofherwise 
prefer pairs wifh fhe longesf common prefix. Global addresses are preferred over 
femporary addresses when available. The specificafion also includes a mefhod of 
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providing "administrative override" to the default rules, but this is deployment- 
specific and we do nof discuss if furfher. 

The selecfion of defaulf addresses is confrolled by a policy table, presenf (af 
leasf concepfually) in each hosf. If is a longesf-mafching-prefix lookup fable, simi¬ 
lar fo a forwarding fable used wifh IP roufing. For an address A, a lookup in fhis 
fable produces a precedence value for A, P(A), and a label for A, L(A). A higher pre¬ 
cedence value indicafes greafer preference. The labels are used for grouping of 
similar address fypes. For example if L(S) = L(D), fhe algorifhm prefers fo use fhe 
pair (S,D) as a source/desfinafion pair. If no ofher policy is specified, [RFC3484] 
suggesfs fhaf fhe policy values from Table 5-10 be used. 


Table 5-10 The default host policy table, according to [RFC3484]. Higher precedence values indicate 
a greater preference. 


Prefix 

Precedence P() 

Label L() 

::1/128 

50 

0 

::/0 

40 

1 

2002: :/16 

30 

2 

::/96 

20 

3 

::ffff:0:0/96 

10 

4 


This table, or one configured at a site based upon administrative configura¬ 
tion parameters, is used to drive the address selection algorithm. The function 
CPL(A,B) or "common prefix length" is the length, in bits, of the longest com¬ 
mon prefix between IPv6 addresses A and B, starting from the left-most signifi¬ 
cant bit. The function S(A) is the scope of IPv6 address A mapped to a numeric 
value with larger scopes mapping to larger values. If A is link-scoped and B is 
global scope, then S(A) < S(B). The function M(A) maps an IPv4 address A to an 
IPv4-mapped IPv6 address. Because the scope properties of IPv4 addresses are 
based on the value of the address itself, the following relations need to be defined: 
S(M(169.254.x.x)) = S(M(127.x.x.x)) < S(M(private address space)) < S(M(any other 
address)). The notation A(A) is the lifecycle of the address (see Chapter 6). A (A) < 
A (B) if A is a deprecated address (i.e., one whose use is discouraged) and B is a pre¬ 
ferred address (i.e., an address preferred for active use). Finally, H(A) is true if A is 
a home address and C(A) is true if A is a care-of address. These last two terms are 
used only in the context of Mobile IP. 

5.6.2.1 The Source Address Selection Algorithm 

The source address selection algorithm defines a candidate set CS(D) of potential 
source addresses based on a particular destination address D. There is a restriction 
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that anycast, multicast, and the unspecified address are never in CS(D) for any D. 
We shall use fhe nofafion R(A) fo indicafe fhe rank of address A in fhe sef CS(D). 
A higher rank (i.e., greafer value of R(A)) for A versus B in CS(D), denofed R(A) > 
R(B), means fhaf A is preferred fo B for use as a source address for reaching fhe 
machine wifh address D. The nofafion R(A) R(B) means fo assign A a higher 
rank fhan B in CS(D). The nofafion 1(D) indicafes fhe inferface selecfed (i.e., by 
fhe forwarding longesf mafching prefix algorifhm described previously) fo reach 
desfinafion D. The nofafion @(i) is fhe sef of addresses assigned fo inferface i. The 
nofafion T(A) is fhe Boolean frue if A is a femporary address (see Chapfer 6) and 
false ofherwise. 

The following rules are applied fo esfablish a parfial ordering befween 
addresses A and B in CS(D) for desfinafion D: 

1. Prefer same address: if A = D, R(A) R(B); if B = D, R(B) R(A). 

2. Prefer appropriafe scope: if S(A) < S(B) and S(A) < S(D), R(B) R(A) else 
R(A) R(B); if S(B) < S(A) and S(B) < S(D), R(A) R(B) else R(B) R(A). 

3. Avoid deprecafed addresses: if S(A) = S(B), {if A(A) < A(B), R(B) R(A) else 
R(A) R(B) }. 

4. Prefer home address: if H(A) and C(A) and —i(C(B) and H(B)), R(A) R(B); 
if H(B) and C(B) and ^(C(A) and H(A)), R(B) R(A); if (H(A) and ^C(A)) 
and (-nH(B) and C(B)), R(A) R(B); if (H(B) and ^C(B)) and (-nH(A) and 
C(A)), R(B) R(A). 

5. Prefer oufgoing inferface: if A g @(I(D)) and B g @(I(D)), R(A) R(B); if B 
g@(I(D)) and A g @(I(D)), R(B) R(A). 

6. Prefer mafching label: if L(A) = L(D) and L(B) ^ L(D), R(A) R(B); if L(B) = 
L(D) and L(A) ^ L(D), R(B) R(A). 

7. Prefer nonfemporary addresses: if T(B) and —iT(A), R(A) R(B); if T(A) and 

^T(B), R(B) R(A). 

8. Use longesf mafching prefix: if CPL(A,D) > CPL(B,D), R(A) R(B); if 
CPL(B,D) > CPL(A,D), R(B) R(A). 

The parfial ordering rules can be used fo form a fofal ordering of all fhe can- 
didafe addresses in CS(D). The one wifh fhe largesf rank is fhe selecfed source 
address for desfinafion D, denofed Q(D), and is used by fhe desfinafion address 
selecfion algorifhm. If Q(D) = 0 (null), no source could be defermined for desfina¬ 
fion D. 

5.6.2.2 The Destination Address Seiection Aigorithm 

We now furn fo fhe problem of defaulf desfinafion address selecfion. If is specified 
in a way similar fo source address selecfion. Recall fhaf Q(D) is fhe source address 
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selected in the preceding example to reach the destination D. Let U(B) be the Bool¬ 
ean true if destination B is not reachable and E(A) indicate that destination A is 
reached using some "encapsulating transport" (e.g., tunneled routing). Using the 
same structure as before on pairwise elemenfs A and B of fhe sef SD(S), we have 
fhe following rules: 

1. Avoid unusable desfinafions: if U(B) or Q(B) = 0, R(A) R(B); if U(A) or 
Q(A) = 0, R(B) R(A). 

2. Prefer mafching scope: if S(A) = S(Q(A)) and S(B) S(Q(B)), R(A) R(B); if 
S(B) = S(Q(B)) and S(A) S(Q(A)), R(B)>^>R(A). 

3. Avoid deprecafed addresses: if A (Q(A)) < A (Q(B)), R(B) *> R(A); if A (Q(B)) 
< A (Q(A)), R(A) R(B). 

4. Prefer home address: if H(Q(A)) and C(Q(A)) and -i(C(Q(B)) and H(Q(B))), 
R(A) R(B); if (Q(B)) and C(Q(B)) and ^(C(Q(A)) and H(Q(A))), R(B) 
R(A); if (H(Q(A)) and ^C(Q(A))) and (-nH(Q(B)) and C(Q(B))), R(A) R(B); 
if (H(Q(B)) and ^C(Q(B))) and (-nH(Q(A)) and C(Q(A))), R(B) R(A). 

5. Prefer mafching label: if L(Q(A)) = L(A) and L(Q(B)) L(B), R(A) R(B); if 

L(Q(A)) L(A) and L(Q(B)) = L(B), R(B) R(A). 

6. Prefer higher precedence: if P(A) > P(B), R(A) R(B); if P(A) < P(B), R(B) 
R(A). 

7. Prefer nafive fransporf: if E(A) and —iE(B), R(B) R(A); if E(B) and —iE(A), 
R(A) R(B). 

8. Prefer smaller scope: if S(A) < S(B), R(A) R(B) else R(B) R(A). 

9. Use longesf mafching prefix: if CPL(A, Q(A)) > CPL(B, Q(B)), R(A) R(B); 
if CPL(A, Q(A)) < CPL (B, Q(B)), R(B) R(A). 

10. 0fherwise, leave rank order unchanged. 

As wifh source address selecfion, fhese rules form a parfial ordering befween 
fwo elemenfs of fhe sef of possible desfinafions in fhe sef of desfinafions SD(S) for 
source S. The highesf-rank address gives fhe oufpuf for fhe desfinafion address 
selecfion algorifhm. As menfioned previously, some issues have been raised 
regarding operafion of fhis algorifhm (e.g., sfep 9 of fhe desfinafion address selec¬ 
fion can lead fo problems wifh DNS round-robin; see Chapfer 11). As a resulf, 
an updafe fo [RPC3484] is being considered [RPC3484-revise]. Imporfanfly, fhis 
revision addresses how so-called Unique Local IPv6 Unicast Addresses (ULAs) 
[RPC4193] are freafed by fhe address selecfion algorifhms. ULAs are globally 
scoped IPv6 addresses fhaf are consfrained fo be used only wifhin a common 
(privafe) nefwork. 
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5.7 Attacks Involving IP 

There have been a number of attacks on the IP protocol over the years, based pri¬ 
marily on the operation of options, or by exploiting bugs in specialized code (such 
as fragment reassembly). Simple attacks involve trying to get a router to crash or 
perform poorly because one or more of the IP header fields is not valid (e.g., bad 
header length or version number). Typically, routers in the Internet today ignore 
or strip IP options, and the bugs in basic packet processing have been fixed. Thus, 
these types of simple attacks are not a big concern. Attacks involving fragmenta¬ 
tion can be addressed using other means [RFC1858][RFC3128]. 

Without authentication or encryption (or when it is disabled for IPv6), IP 
spoofing attacks are possible. Some of the earliest attacks involved fabricating 
the source IP address. Because early access control mechanisms depended on the 
source IP address, many such systems were circumvented. Spoofing would some¬ 
times be combined with various combinations of source routing options. Under 
some circumstances, a remote attacker's computer would appear to be a host on 
the local network (or even the same computer) requesting some sort of service. 
Although the spoofing of IP addresses is still a concern today, there are several 
approaches to limit its damage, including ingress filtering [RFC2827][RFC3704], 
whereby an ISP checks the source addresses of its customers' traffic to ensure that 
datagrams contain source addresses from an assigned IP prefix. 

As IPv6 and Mobile IP are relatively new, at least compared to IPv4, all of 
their vulnerabilities have undoubtedly not yet been discovered. With the newer 
and more flexible types of options headers, an attacker could have considerable 
influence on the processing of an IPv6 packet. For example, the Routing header 
(type 0) was discovered to have such severe security problems that its use has 
been deprecated entirely. Other possibilities include spoofing the source address 
and/or Routing header entries to make packets appear as if they have come from 
other places. These attacks are avoided by configuring packet-filtering firewalls to 
take into account the contents of Routing headers. It is worth noting that simply 
filtering out all packets containing extension headers and options in IPv6 would 
severely restrict its use. In particular, disabling extension headers would prevent 
Mobile IPv6 from functioning. 


5.8 Summary 

In this chapter we started with a description of the IPv4 and IPv6 headers, discuss¬ 
ing some of the related functions such as the Internet checksum and fragmenta¬ 
tion. We saw how IPv6 increases the size of addresses, improves upon IP's method 
of including options in packets by use of the extension headers, and removes sev¬ 
eral of the noncritical fields from the IPv4 header. With the addition of this func¬ 
tionality, the IP header increases in size by only a factor of 2 even though the 
size of the addresses has increased fourfold. The IPv4 and IPv6 headers are not 
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directly compatible and share only the 4-bit Version field in common. Because of 
fhis, some level of franslafion is required fo inferconnecf IPv4 and IPv6 nodes. 
Dual-sfack hosfs implemenf bofh IPv4 and IPv6 buf musf choose which protocol 
fo use and when. 

Since ifs incepfion, IP has included a header field fo indicafe a type of fraffic 
or service class associafed wifh each dafagram. This mechanism has been rede¬ 
fined over fhe years in hopes of providing mechanisms fo supporf differenfiafed 
services on fhe Infernef. If if is widely implemenfed, fhe Infernef could pofenfially 
offer improved performance for some fraffic or users versus ofhers in a sfandard 
way. To whaf exfenf fhis happens will be based in parf on working ouf fhe busi¬ 
ness models surrounding fhe differenfiafed services capabilify 

IP forwarding describes fhe way IP dafagrams are fransporfed fhrough single 
and mulfihop nefworks. IP forwarding is performed on a hop-by-hop basis unless 
special processing fakes place. The desfinafion IP address never changes as fhe 
dafagram proceeds fhrough all fhe hops, buf fhe link-layer encapsulafion and des¬ 
finafion link-layer address change on each hop. Forwarding fables and fhe longesf 
prefix mafch algorifhm are used by hosfs and routers fo defermine fhe besf mafch- 
ing forwarding enfry and defermine fhe nexf hops along a forwarding pafh. In 
many circumsfances, very simple fables consisfing of only a defaulf roufe, which 
mafches all possible desfinafions equally, are adequafe. 

Using a special sef of protocols for securify and signaling. Mobile IP esfab- 
lishes secure bindings befween a mobile node's home address and care-of address. 
These bindings may be used fo communicate wifh a mobile node even when if is 
nof af home. The basic funcfion involves funneling fraffic fhrough a cooperafing 
home agenf, buf fhis may lead fo very inefficienf roufing. A number of addifional 
feafures supporf a roufe opfimizafion feafure fhaf allows a mobile node fo falk 
direcfly wifh ofher remofe nodes and vice versa. This requires a mobile node's 
correspondenf hosfs fo supporf MIPv6 as well as roufe opfimizafion, which is an 
opfional feafure. Ongoing work aims af reducing fhe lafency involved in fhe roufe 
opfimizafion binding updafe procedure. 

We also looked af how fhe hosf model, sfrong or weak, affecfs how IP dafa¬ 
grams are processed. In fhe sfrong model, each interface is permiffed fo receive 
or send only dafagrams fhaf use addresses associafed wifh fhe interface, whereas 
fhe weak model is less resfricfive. The weak hosf model permifs communicafion 
in some cases where if would nof ofherwise be possible buf may be more vulner¬ 
able fo cerfain kinds of affacks. The hosf model also relafes fo how a hosf chooses 
which addresses fo use when communicafing. Early on, mosf hosfs had only one 
IP address so fhe decision was fairly sfraighfforward. Wifh IPv6, in which a hosf 
may have several addresses, and for mulfihomed hosfs using several nefwork 
interfaces, fhe decision is less sfraighfforward yef nonefheless may have an impor- 
fanf impacf on roufing. A sef of address selecfion algorifhms, for bofh source and 
desfinafion addresses, was presenfed. These algorifhms fend fo prefer limifed- 
scope, permanenf addresses. 
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We discussed some of the attacks targeted against the IP protocol. Such 
attacks have often involved spoofing addresses, including options to alter rout¬ 
ing behavior, and attempts to exploit bugs in the implementation of IP, especially 
with respect to fragmentation. The protocol implementation bugs have been fixed 
in modern operating systems, and in most cases options are disabled at the edge 
routers of enterprises. Although spoofing remains somewhat of a concern, proce¬ 
dures such as ingress filtering help to eliminate this problem as well. 
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System Configuration: DHCP 
and Autoconfiguration 


6.1 Introduction 

To make use of the TCP/IP protocol suite, each host and router requires a certain 
amount of configuration information. Configuration information is used to assign 
local names to systems, and identifiers (such as IP addresses) to interfaces. It is 
also used to either provide or make use of various network services, such as the 
Domain Name System (DNS) and Mobile IP home agents. Over the years there have 
been many ways of providing and obtaining this information, but fundamen¬ 
tally there are three approaches: type in the information by hand, have a system 
obtain it using a network service, or use some sort of algorithm to automatically 
determine it. We shall explore each of these options and see how they are used 
with both IPv4 and IPv6. Understanding how configuration works is important, 
because it is one of the issues that every system administrator and nearly every 
end user must deal with to some extent. 

Recall from Chapter 2 that every interface to be used with TCP/IP networking 
requires an IP address, subnet mask, and broadcast address (for IPv4). The broad¬ 
cast address can ordinarily be determined using the address and mask. With this 
minimal information, it is generally possible to carry out communication with 
other systems on the same subnetwork. To engage in communication beyond the 
local subnet, called indirect delivery in Chapter 5, a system requires a routing or 
forwarding table that indicates what router(s) are to be used for reaching vari¬ 
ous destinations. To be able to use services such as the Web and e-mail, the DNS 
(see Chapter 11) is used to map user-friendly domain names to the IP addresses 
required by the lower-protocol layers. Because the DNS is a distributed service, 
any system making use of it must know how to reach at least one DNS server. 
All in all, having an IP address, subnet mask, and the IP address of a DNS server 
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and router are the "bare essentials" to get a system running on the Internet that 
is capable of using or providing popular services such as Web and e-mail. To use 
Mobile IP, a system also needs to know how to find a home agent. 

In this chapter we will focus primarily on the protocols and procedures used 
to establish the bare essentials in Internet client hosts: the Dynamic Host Configu¬ 
ration Protocol (DHCP) and stateless address autoconfiguration in IPv4 and IPv6. We 
will also discuss how some ISPs use PPP with Ethernet for configuration of client 
systems. Servers and routers are more often configured by hand, usually by typ¬ 
ing the relevant configuration information into a file or graphical user interface. 
There are several reasons for this distinction. First, client hosts are moved around 
more often than servers and routers, meaning they should have mechanisms for 
flexibly reassigning their configuration information. Second, server hosts and 
routers are expected to be "always available" and relatively autonomous. As such, 
having their configuration information not depend on other network services can 
lead to greater confidence in their reliability. Third, there are often far more clients 
in an organization than servers or routers, so it is simpler and less error-prone to 
use a centralized service to dynamically assign configuration information to cli¬ 
ent hosts. Fourth, the operators of clients often have less system administration 
experience than server and router administrators, so it is once again less error- 
prone to have most clients configured by a centralized service administered by an 
experienced staff. 

Beyond the bare essentials, there are numerous other bits of configuration 
information a host or router may require, depending on the types of services it 
uses or provides. These may include the locations of home agents, multicast rout¬ 
ers, VPN gateways, and Session Initiation Protocol (SIP)/VoIP gateways. Some of 
these services have standardized mechanisms and supporting protocols to obtain 
the relevant configuration information; others do not and instead require the user 
to type in the necessary information. 


6.2 Dynamic Host Configuration Protocoi (DHCP) 

DHCP [RFC2131] is a popular client/server protocol used to assign configuration 
information to hosts (and, less frequently, to routers). DHCP is very widely used, 
in both enterprises and home networks. Even the most basic home router devices 
support embedded DHCP servers. DHCP clients are incorporated into all common 
client operating systems and a large number of embedded devices such as net¬ 
work printers and VoIP phones. Such devices usually use DHCP to acquire their IP 
address, subnet mask, router IP address, and DNS server IP address. Information 
pertaining to other services (e.g., SIP servers used with VoIP) may also be conveyed 
using DHCP. DHCP was originally conceived for use with IPv4, so references to 
it or its relationship with IP in this chapter will refer to IPv4 unless otherwise 
specified. IPv6 can also use a version of DHCP called DHCPv6 [RFC3315], which 
we discuss in Section 6.2.5, but IPv6 also supports its own automatic processes to 
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determine configuration information. In a hybrid configuration, IPv6 automatic 
configuration can be combined with the use of DHCPv6. 

The design of DHCP is based on an earlier protocol called the Internet Boot¬ 
strap Protocol (BOOTP) [RFC0951][RFC1542], which is now effectively obsolete. 
BOOTP provides limited configuration information to clients and does not have 
a mechanism to support changing that information after it has been provided. 
DHCP extends the BOOTP model with the concept of leases [GC89] and can pro¬ 
vide all information required for a host to operate. Leases allow clients to use 
the configuration information for an agreed-upon amount of time. A client may 
request to renew the lease and continue operations, subject to agreement from 
the DHCP server. BOOTP and DHCP are backward-compatible in the sense that 
BOOTP-only clients can make use of DHCP servers and DHCP clients can make 
use of BOOTP-only servers. BOOTP, and therefore DHCP as well, is carried using 
UDP/IP (see Chapter 10). Clients use port 68 and servers use port 67. 

DHCP comprises two major parts: address management and delivery of 
configuration data. Address management handles the dynamic allocation of IP 
addresses and provides address leases to clients. Configuration data delivery 
includes the DHCP protocol's message formats and state machines. A DHCP 
server can be configured to provide three levels of address allocation: automatic 
allocation, dynamic allocation, and manual allocation. The differences among the 
three have to do with whether the addresses assigned are based on the identity of 
the client and whether such addresses are subject to being revoked or changed. 
The most commonly used method is dynamic allocation, whereby a client is given 
a revocable IP address from a pool (usually a predefined range) of addresses con¬ 
figured at the server. In automatic allocation, the same method is used but the 
address is never revoked. In manual allocation, the DHCP protocol is used to con¬ 
vey the address, but the address is fixed for the requesting client (i.e., it is not part 
of an allocatable pool maintained by the server). In this last mode, DHCP acts like 
BOOTP. We shall focus on dynamic allocation, as it is the most interesting and 
common case. 

6.2.1 Address Pools and Leases 

In dynamic allocation, a DHCP client requests the allocation of an IP address. 
The server responds with one address selected from a pool of available addresses. 
Typically, the pool is a contiguous range of IP addresses allocated specifically for 
DHCP's use. The address given to the client is allocated for only a specific amount 
of time, called the lease duration. The client is permitted to use the IP address until 
the lease expires, although it may request extension of the lease as required. In 
most situations, clients are able to renew leases they wish to extend. 

The lease duration is an important configuration parameter of a DHCP server. 
Lease durations can range from a few minutes to days or more ("infinite" is pos¬ 
sible but not recommended for anything but simple networks). Determining the 
best value to use for leases is a trade-off between the number of expected clients. 



236 System Configuration: DHCP and Autoconfiguration 


the size of the address pool, and the desire for the stability of addresses. Longer 
lease durations tend to deplete the available address pool faster but provide greater 
stability in addresses and somewhat reduced network overhead (because there 
are fewer requests to renew leases). Shorter leases tend to keep the pool available 
for other clients, with a consequent potential decrease in stability and increase in 
network traffic load. Common defaults include 12 to 24 hours, depending on the 
particular DHCP server being used. Microsoft, for example, recommends 8 days 
for small networks and 16 to 24 days for larger networks. Clients begin trying to 
renew leases after half of the lease duration has passed. 

When making a DHCP request, a client is able to provide information to the 
server. This information can include the name of the client, its requested lease 
duration, a copy of the address it is already using or last used, and other parame¬ 
ters. When the server receives such a request, it can make use of whatever informa¬ 
tion the client has provided (including the requesting MAC address) in addition 
to other exogenous information (e.g., the time of day, the interface on which the 
request was received) to determine what address and configuration information 
to provide in response. In providing a lease to a client, a server stores the lease 
information in persistent memory, typically in nonvolatile memory or on disk. If 
the DHCP server restarts and all goes well, leases are maintained intact. 

6.2.2 DHCP and BOOTP Message Format 

DHCP extends BOOTP, DHCP's predecessor. Compatibility is maintained between 
the protocols by defining the DHCP message format as an extension to BOOTP's 
in such a way that BOOTP clients can be served by DHCP servers, and BOOTP 
relay agents (see Section 6.2.6) can be used to support DHCP use, even on networks 
where DHCP servers do not reside. The message format includes a fixed-length 
initial portion and a variable-length tail portion (see Figure 6-1). 

The message format of Figure 6-1 is defined by BOOTP and DHCP in several 
RFCs ([RFC0951][RFC1542][RFC2131]). The Op (Operation) field identifies the mes¬ 
sage as either a request (1) or a reply (2). The HW Type (htype) field is assigned 
based on values used with ARP (see Chapter 4) and defined in the corresponding 
I ANA ARP parameters page [I ARP], with the value 1 (Ethernet) being very com¬ 
mon. The HW Len (Men) field gives the number of bytes used to hold the hardware 
(MAC) address and is commonly 6 for Ethernet-like networks. The Hops field is 
used to store the number of relays through which the message has traveled. The 
sender of the message sets this value to 0, and it is incremented at each relay. 
The Transaction ID is a (random) number chosen by the client and copied into 
responses by the server. It is used to match replies with requests. 

The Secs field is set by the client with the number of seconds that have elapsed 
since the first attempt to establish or renew an address. The Flags field currently 
contains only a single defined bit called the broadcast flag. Clients may set this bit 
in requests if they are unable or unwilling to process incoming unicast IP data¬ 
grams but can process incoming broadcast datagrams (e.g., because they do not 
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Op (Req/Reply) HWType HW Len Hops 

{8 bits) (htype, 8 bits) {hlen, 8 bits) {8 bits) 

Transaction ID (xid, 32 bits) 




Secs (16 bits) 


Fiags (16 bits) 


Client IP Address (ciaddr, 32 bits, if known) 


“Your” IP address (yiaddr, 32 bits) 


(Next) Server iP Address (siaddr, 32 bits) 
Gateway (Reiay) iP Address (giaddr, 32 bits) 


Client Hardware Address (chaddr, 128 bits) 


Server Name (sname, 64 bytes) 


Boot Fiie Name (file, 128 bytes) 


Options (vend, variable) 


Flags (16 bits) 


Zero 




Broadcast Flag 




V Option Overload Area 
f Alternative Area to Hold Options 


BOOTP Option 53 
Indicates DHCP Message 


Figure 6-1 The BOOTP message format, including field names from [RFC0951], [RFC1542], and [RFC2131]. 

The BOOTP message format is used to hold DHCP messages by appropriate assignment of options. 
In this way, BOOTP relay agents can process DHCP messages, and BOOTP clients can use DHCP 
servers. The Server Name and Boot File Name fields can be used to carry DHCP options if necessary. 


yet have an IP address). Setting the bit informs the server and relays that broad¬ 
cast addressing should be used for replies. 


Note 

There has been some difficulty in Windows environments regarding the use of 
the broadcast flag. Windows XP and Windows 7 DHCP ciients do not set the 
flag, but Windows Vista ciients do. Some DHCP servers in use do not process 
the flag properly, leading to apparent difficulties in supporting Vista clients, even 
though the Vista implementation is RFC-compliant. See [IVIKB928233] for more 
information. 


The next four fields are various IP addresses. The Client IP Address (ciaddr) 
field includes a currenf IP address of fhe requestor, if known, and is 0 ofherwise. 
The "Your" IP Address (yiaddr) field is filled in by a server when providing an 
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address to a requesting client. The Next Server IP Address (siaddr) field gives the IP 
address of the next server to use for the client's bootstrap process (e.g., if the client 
needs to download an operating system image that may be accomplished from a 
server other than the DHCP server). The Gateway (or Relay) IP Address (giaddr) field 
is filled in by a DHCP or BOOTP relay with its address when forwarding DHCP 
(BOOTP) messages. The Client Hardware Address (chaddr) field holds a unique 
identifier of the client and can be used in various ways by the server, including 
arranging for the same IP address to be given each time a particular client makes 
an address request. This field has traditionally held the client's MAC address, 
which has been used as an identifier. Nowadays, the Client Identifier, an option 
described in Sections 6.2.3 and 6.2.4, is preferred for this use. 

The remaining fields include the Server Name (sname) and Boot File Name (file) 
fields. These fields are not always filled in, but if they are, they contain 64 or 128 
bytes, respectively, of ASCII characters indicating the name of the server or path to 
the boot file. Such strings are null-terminated, as in the C programming language. 
They can also be used instead to hold DHCP options if space is tight (see Section 
6.2.3). The final field, originally known as the Vendor Extensions field in BOOTP 
and fixed in length, is now known as the Options field and is variable in length. As 
we shall see, options are used extensively with DHCP and are required to distin¬ 
guish DHCP messages from legacy BOOTP messages. 

6.2.3 DHCP and BOOTP Options 

Given that DHCP extends BOOTP, any fields needed by DHCP that were not pres¬ 
ent when BOOTP was designed are carried as options. Options take a standard 
format beginning with an 8-bit tag indicating the option type. For some options, 
a fixed number of bytes following the tag contain the option value. All others 
consist of the tag followed by 1 byte containing the length of the option value (not 
including the tag or length), followed by a variable number of bytes containing the 
option value itself. 

A large number of options are available with DHCP, some of which are also 
supported by BOOTP The current list is given by the BOOTP/DHCP parameters 
page [IBDP]. The first 77 options, including the most common ones, are speci¬ 
fied in [RFC2132]. Common options include Pad (0), Subnet Mask (1), Router 
Address (3), Domain Name Server (6), Domain Name (15), Requested IP Address 
(50), Address Lease Time (51), DHCP Message Type (53), Server Identifier (54), 
Parameter Request List (55), DHCP Error Message (56), Lease Renewal Time (58), 
Lease Rebinding Time (59), Client Identifier (61), Domain Search List (119), and 
End (255). 

The DHCP Message Type option (53) is a 1-byte-long option that is always used 
with DHCP messages and has the following possible values: DHCPDISCOVER 
(1), DHCPOEEER (2), DHCPREQUEST (3), DHCPDECLINE (4), DHCPACK (5), 
DHCPNAK (6), DHCPRELEASE (7), DHCPINEORM (8), DHCPEORCERENEW 
(9) [RFC3203], DHCPLEASEQUERY (10), DHCPLEASEUNASSIGNED (11), 



Section 6.2 Dynamic Host Configuration Protocol (DHCP) 


239 


DHCPLEASEUNKNOWN (12), and DHCPLEASEACTIVE (13). The last four val¬ 
ues are defined by [REC4388]. 

Opfions may be carried in fhe Options field of a DHCP message, as well as in 
fhe Server Name and Boot File Name fields menfioned previously. When opfions are 
carried in eifher of fhese laffer fwo places, called option overloading, a special Over¬ 
load opfion (52) is included fo indicafe which fields have been appropriafed for 
holding opfions. Por opfions whose lengfhs exceed 255 byfes, a special long options 
mechanism has been defined [REC3396]. In essence, if fhe same opfion is repealed 
mulfiple limes in fhe same message, fhe confenfs are concafenafed in fhe order in 
which fhey appear in fhe message, and fhe resulf is processed as a single opfion. If 
a long opfion also uses opfion overloading, fhe order of processing is lasf fo firsf: 
Options field. Boot File Name field, and fhen Server Name field. 

Opfions fend fo eifher provide relafively simple configurafion informafion or 
be used in supporfing some ofher agreemenf protocol. Eor example, [REC2132] 
specifies opfions for mosf of fhe fradifional configurafion informafion a TCP/IP 
node requires (addressing informafion, server addresses. Boolean assignmenfs of 
configurafion informafion such as enabling IP forwarding, inifial TTL values). 
Subsequenf specificafions describe simple configurafion informafion for NefWare 
[RPC2241][RPC2242], user classes [RPC3004], PQDN [RPC4702], Infernef Storage 
Name Service server (iSNS, used in sforage nefworks) [RPC4174], Broadcasf and 
Mulficasf Service confroller (BCMCS, used wifh 3G cellular nefworks) [RPC4280], 
lime zone [RPC4833], aufoconfigurafion [RPC2563], subnef selecfion [RPC3011], 
name service selecfion (see Chapfer 11) [RPC2937], and servers for fhe Protocol 
for Carrying Authentication for Network Access (PANA) (see Chapfer 18) [RPC5192]. 
Those opfions defined for use in supporf of ofher profocols and funcfions are 
described later, sfarfing wifh Secfion 6.2.7. 

6.2.4 DHCP Protocol Operation 

DHCP messages are essentially BOOTP messages with a special set of options. 
When a new client attaches to a network, it first discovers what DHCP servers are 
available and what addresses they are offering. It then decides which server to 
use and which address it desires and requests it from the offering server (while 
informing all the servers of its choice). Unless the server has given away the 
address in the meantime, it responds by acknowledging the address allocation 
to the requesting client. The time sequence of events between a typical client and 
server is depicted in Pigure 6-2. 

Requesting clients set the BOOTP Op field to BOOTREQUEST and the first 
4 bytes of the Options field to the decimal values 99, 130, 83, and 99, respectively 
(the magic cookie value from [REC2132]). Messages from client to server are sent as 
UDP/IP datagrams containing a BOOTP BOOTREQUEST operation and an appro¬ 
priate DHCP message type (usually DHCPDISCOVER or DHCPREQUEST). Such 
messages are sent from address O.O.O.O (port 68) to the limited broadcast address 
255.255.255.255 (port 67). Messages traveling in the other direction (from server to 
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Figure 6-2 A typical DHCP exchange. A client discovers a set of servers and addresses they are 
offering using broadcast messages, requests the address it desires, and receives an 
acknowledgment from the selected server. The transaction ID (xid) allows requests and 
responses to be matched up, and the server ID (an option) indicates which server is pro¬ 
viding and committing the provided address binding with the client. If the client already 
knows the address it desires, the protocol can be simplified to include use of only the 
REQUEST and ACK messages. 


client) are sent from the IP address of the server and port 67 to the IP local broad¬ 
cast address and port 68 (see Chapter 10 for details on UDP). 

In a typical exchange, a client first broadcasts a DHCPDISCOVER message. Each 
server receiving the request, either directly or through a relay, may respond with a 
DHCPOEEER message, including an offered IP address in the "Your" IP Address 
field. Other configuration options (e.g., IP address of DNS server, subnet mask) are 
often included. The offer message includes the lease time (T), which provides the 
upper bound on the amount of time the address can be used if it is not renewed. The 
message also contains the renewal time (Tl), which is the amount of time before the 
client should attempt to renew its lease with the server from which it acquired its 
lease, and the rebinding time (T2), which bounds the time in which it should attempt 
to renew its address with any DHCP server. By default, Tl = (T/2) and T2 = (7T/8). 

After receiving one or more DHCPOEEER messages from one or more servers, 
the client determines which offer it will accept and broadcasts a DHCPREQUEST 
message including the Server Identifier option. The Requested IP Address option is 
set to the address received in the selected DHCPOEEER message. Multiple servers 
may receive the broadcast DHCPREQUEST message, but only the server identified 
within the DHCPREQUEST message acts by committing the address binding to 
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persistent storage; the others clear any state regarding the request. After handling 
the binding, the selected server responds with a DHCPACK message, indicating to 
the client that the address binding can now be used. In the case where the server 
cannot allocate the address contained in the DHCPREQUEST message (e.g., it has 
been allocated in some other way or is not available), the server responds with a 
DHCPNAK message. 

Once the client receives the DHCPACK message and other associated configu¬ 
ration information, it may probe the network to ensure that the address provided 
is not in use (e.g., by sending an ARP request for the address to perform ACD, 
described in Chapter 4). Should the client determine that the address is already in 
use, the client ceases using the address and sends a DHCPDECLINE message to 
the server to indicate that the address cannot be used. After a recommended 10s 
delay, the client is able to retry. If a client elects to relinquish its address before its 
lease expires, it sends a DHCPRELEASE message. 

In circumstances where a client already has an IP address and wishes only 
to renew its lease, the initial DHCPDISCOVER/DHCPOPPER messages can be 
skipped. Instead, the protocol begins with the client requesting the address it 
is currently using with a DHCPREQUEST message. At this point, the protocol 
works as already described: the server will likely grant the request (with a DHC¬ 
PACK) or deny the request by issuing a DHCPNAK. Another circumstance arises 
when a client already has an address, does not need to renew it, but requires other 
(non-address) configuration information. In this case, it can use a DHCPINPORM 
message in place of a DHCPREQUEST message to indicate its use of an existing 
address and desire to obtain additional information. Such messages elicit a DHC¬ 
PACK message from the server, which includes the requested additional configu¬ 
ration information. 

6.2.4.7 Example 

To see DHCP in action, we now inspect the packets exchanged when a Microsoft 
Vista laptop attaches to a wireless LAN supported by a Linux-based DHCP server 
(Windows 7 systems are nearly identical). The client was recently associated with a 
different wireless network, using a different IP prefix, and is now being connected 
to the new network. Because it remembers the address it had from the previous net¬ 
work, the client first tries to continue using that address using a DHCPREQUEST 
message (see Eigure 6-3). 


Note 

There is now an agreed-upon procedure for detecting network attachment (DNA), 
specified in [RFC4436] for IPv4 and [RFC6059] for IPv6. These specifications do 
not contain new protocols but instead suggest how unicast ARP (for iPv4) and 
a combination of unicast and muiticast Neighbor Soiicitation/Router Discovery 
messages (for IPv6; see Chapter 8) can be used to reduce the latency of acquir¬ 
ing configuration information when a host switches network links. As these speci¬ 
fications are reiatively new (especialiy for iPv6), not ail systems implement them. 
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5^ vista-dhcp.tr - Wireshark 
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10.000000 DHCP 0.0.0.0 255.255.255.255 DHCP Request -Transaction ID 0xdb23147d 


20.018650 DHCP 10.0.0.1 255.255.255.255 DHCP NAK -Transaction ID 0xdb23147d 

3 1.083053 DHCP 0. 0. 0. 0 2 5 5.255.255.255 DHCP Discover - Transaction ID 0x3a681b0b 

4 4.084315 DHCP 10.0.0.1 2 5 5.255.255.255 DHCP Offer - Transaction ID 0x3a681b0b 

5 4.087406 DHCP 0. 0. 0. 0 2 5 5.255.255.255 DHCP Request - Transaction ID 0x3a681b0b 

6 4.104592 DHCP 10.0.0.1 2 5 5.255.255.255 DHCP ACK - Transaction ID 0x3a681b0b 

> 

111 Frame 1: 342 bytes on wire (2736 bits), 342 bytes captured (2736 bits) 

li) Ethernet ll, src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: ff:ff:ff:ff:ff:ff (ff:ff:ff:ff:ff:ff) 
12 internet protocol, src: 0.0.0.0 (0.0.0.0), DSt: 2 55.2 55.2 55.2 55 (25 5.25 5.255.2 55) 

12 user Datagram protocol, src Port: 68 (68), Dst port: 67 (67) 

12 Bootstrap protocol 

Message type: Boot Request (1) 

Hardware type: Ethernet 
Hardware address length: 6 
Hops: 0 

Transaction id: 0xdb23147d 
seconds elapsed: 0 
m Bootp flags: 0x8000 (Broadcast) 
client IP address: 0.0.0.0 (0.0.0.0) 

Your (client) IP address: 0.0.0.0 (0.0.0.0) 

Next server IP address: 0.0.0.0 (0.0.0.0) 

Relay agent IP address: 0.0.0.0 (0.0.0.0) 

Client MAC address: 00:13:02:20:b9:18 (00:13:02:20:b9:18) 

Client hardware address padding: 00000000000000000000 
Server host name not given 
Boot file name not given 
Magic cookie: DHCP 

a Option: (t=53,l=l) DHCP Message Type = DHCP Request 
B Option: (t=61,l=7) Client identifier 
B Option: (t=50,l=4) Requested IP Address = 172.16.1.34 
B Option: (t=12,l=5) Host Name = "vista" 

B Option: (t=81,l=8) Client Fully Qualified Domain Name 
B Option: (t®60,l=8) Vendor class identifier = "MSFT 5.0" 

B Option: (t=55,l*12) Parameter Request List 
End Option 


Figure 6-3 A client has switched networks and attempts to request its old address, 172.16.1.34, from 
a DHCP server on the new network using a DHCPREQUEST message. 


In Figure 6-3 we can see a DHCP request sent in a link-layer broadcast frame 
(destination ff:ff:ff:ff:ff:ff) using the unspecified source IP address O.O.O.O and the 
limited broadcast destination address 255.255.255.255. Because the client does not 
yet know if the address it is requesting will be successfully allocated and does 
not know the network prefix used on the network to which it is attaching, it has 
little alternative to using these addresses. The message is a UDP/IP datagram sent 
from the BOOTP client port 68 (bootpc) to the server port 67 (bootps). As DHCP 
is really part of BOOTP, the protocol is the Bootstrap Protocol and the message 
type is a BOOTREQUEST (1), with hardware type set to 1 (Ethernet) and address 
length of 6 bytes. The transaction ID is 0xdb23147d, a random number chosen by 
the client. The BOOTP broadcast flag is set in this message, meaning responses 
should be sent using broadcast addressing. The requested address of 172.16.1.34 
is contained in one of several options. We shall have a closer look at the types of 
options that appear in DHCP messages beginning in Section 6.2.9. 
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The nearby DHCP server receives the client's DHCPREQUEST message 
including the requested IP address of 172.16.1.34. However, the server is unable to 
allocate the address because 172.16.1.34 is not in use on the current network. Con¬ 
sequently, the server refuses the client's request by sending a DHCPNAK message 
(see Pigure 6-4). 
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3 1.083053 DHCP 0. 0. 0. 0 2 5 5.255.255.255 DHCP Discover - Transaction ID 0x3a681b0b 

44.084315 DHCP 10.0.0.1 255.255.255.255 DHCP Offer - Transaction ID 0x3a681b0b 

5 4.087406 DHCP 0. 0. 0. 0 2 5 5.255.255.255 DHCP Request -Transaction ID 0x3a681b0b 

6 4.104592 DHCP 10.0.0.1 2 5 5.255.255.255 DHCP ACK - Transaction ID 0x3a681b0b 

> 

111 Frame 2: 342 bytes on wire (2736 bits), 342 bytes captured (2736 bits) 

s Ethernet II, src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: ff:ff:ff:ff:ff:ff (ff:ff:ff:ff:ff:ff) 
a internet protocol, Src: 10.0.0.1 (10.0.0.1), Dst: 2 55.2 55.2 55.2 55 (25 5.255.255.2 55) 
a user Datagram Protocol, src Port: 67 (67), Dst Port: 68 (68) 
a Bootstrap Protocol 

Message type: Boot Reply (2) 

Hardware type: Ethernet 
Hardware address length: 6 
Hops: 0 

Transaction ID: 0xdb23147d 
Seconds elapsed: 0 
a Bootp flags: 0x8000 (Broadcast) 
client IP address: 0.0.0.0 (0.0.0.0) 

Your (client) IP address: 0.0.0.0 (0.0.0.0) 

Next server IP address: 0.0.0.0 (0.0.0.0) 

Relay agent IP address: 0.0.0.0 (0.0.0.0) 

client MAC address: 00:13:02;20:b9:18 (00:13:02:20:b9:18) 

client hardware address padding: 00000000000000000000 

Server host name not given 

Boot file name not given 

Magic cookie: DHCP 

a option: (t=53,l=l) dhcp Message Type = dhcp nak 
a option: (t=54,l=4) dhcp server identifier = 10.0.0.1 
a option: (t=56,l=13) Message = "wrong address" 

End option 
padding 

> 


Figure 6-4 A DHCPNAK message is sent by the DHCP server, indicating that the client should not 
attempt to use IP address 172.16.1.34. The transaction ID allows the client to know that 
the message corresponds to its address request. 


The DHCPNAK message shown in Eigure 6-4 is sent as a broadcast BOOTP 
reply from the server. It includes the message type of DHCPNAK, a transaction ID 
matching the client's request, a Server Identifier option containing 10.0.0.1, a copy 
of the client's identifier (MAC address in this case), and a textual string indicating 
the form of error, "wrong address". At this point the client ceases trying to use 
its old address of 172.16.1.34 and instead starts over, looking for whatever servers 
and addresses it can find, using a DHCPDISCOVER message (see Pigure 6-5). 
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Hardware address length: 6 
Hops: 0 

Transaction ID: 0x3a681b0b 
seconds elapsed: 0 
a Bootp flags: 0x8000 (Broadcast) 
client IP address; 0.0.0.0 (0.0.0.0) 

Your (client) IP address: 0.0.0.0 (0.0.0.0) 

Next server IP address: 0.0.0.0 (0.0.0.0) 

Relay agent IP address: 0.0.0.0 (0.0.0.0) 

Client MAC address: 00:13:02:20:b9:18 (00:13:02:20:b9:18) 

Client hardware address padding: 00000000000000000000 
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Boot file name not given 
Magic cookie: DHCP 
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Figure 6-5 The DHCPDISCOVER message indicates that the client is retrying its attempt to obtain 
an address after the previous failure of its DHCPREQUEST message. 


The DHCPDISCOVER message sent by the client and shown in Figure 6-5 
is similar to the DHCPREQUEST message, including the requested IP address it 
used before (it does not have any other address to request), but it contains a richer 
list of options and a new transaction ID (0x3a681b0b). Most of the rest of the pri¬ 
mary BOOTP fields are left empty and set to 0, except the client MAC address, 
which appears in the Client Hardware Address (chaddr) field. Note that this address 
matches the Ethernet frame source MAC address, as expected, because the packet 
was not forwarded through a BOOTP relay agent. The rest of the DISCOVER mes¬ 
sage contains eight options, most of which are expanded in the screen shot in 
Figure 6-6 so that the various option subtypes can be seen. 

Figure 6-6 details the options included in the BOOTP request message. The first 
option indicates that the message is a DHCPDISCOVER message. The second option 
indicates a client's desire to know whether to use address autoconfiguration [RFC2563] 
(described in Section 6.3). If it is unable to obtain an address using DHCP, it is per¬ 
mitted to determine one itself if allowed to do so by the DHCP server. 
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15 = Domain Name 
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6 = Domain Name server 

44 = NetBIOS over tcp/ip Name server 

46 = NetBIOS over TCP/IP Node Type 

47 = NetBIOS over TCP/IP Scope 
31 = Perform Router Discover 
33 = Static Route 

121 = Classless static Route 

249 = Private/Classless Static Route (Microsoft) 

43 = vendor-specific information 

End Option _ 

Padding 

> 


Figure 6-6 The DHCPDISCOVER message may contain a rich list of parameter requests, indicating 
what configuration information the client seeks. 


The next option indicates that the Client Identifier (ID) option is set to 
0100130220B918 (not shown). The DHCP server can use the client ID to determine 
if fhere is any special configurafion informafion fo be given fo fhe parficular 
requesfing clienf. Mosf operafing sysfems now allow fhe user fo specify fhe clienf 
ID for fhe DHCP clienf fo use when obfaining an address. Generally, however, if is 
beffer fo allow fhe clienf ID fo be chosen aufomafically, as fhe use of fhe same cli¬ 
enf ID by mulfiple clienfs can lead fo DHCP problems. The aufomafically selecfed 
clienf ID is generally based on fhe MAC address of fhe clienf. In fhe case of Win¬ 
dows, if is fhe MAC address wifh a 1-byfe hardware fype idenfifier prepended fo 
if (in fhis case, fhe value of fhe byfe is 1, indicafing Efhernef). 


Note 

There has been a move to use client identifiers that are not based on MAC 
addresses. This is motivated by the desire to have a persistent identifier for a cii- 
ent for use with IPv4 or IPv6 that remains consistent even if the system’s network 
interface hardware changes (which usualiy causes its MAC address to change). 
[RFC4361] specifies node-specific identifiers for IPv4, using a scheme originaliy 
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defined for IPv6. It involves using a DHCP Unique Identifier (DUiD) in combination 
with an identity Association Identifier (lAID) as specified for DHCPv6 [RFC3315] 
(aiso see Sections 6.2.5.3 and 6.2.5.4), but with conventionai DHCPv4. It also 
deprecates the use of the Client Hardware Address (chaddr) field in DHCP mes¬ 
sages. However, it is not yet widely deployed. 


The next (Requested IP Address) option indicates that the client is requesting 
IP address 172.16.1.34. This is the IP address it was using when associated with the 
previous wireless network. As mentioned before, this address is not available on 
the new network because a different network prefix is being used. 

Other options indicate a configured host name of "vista," a vendor class ID 
of "MSFT 5.0" (for Microsoft Windows 2000 and later systems), and a parameter 
request list. The Parameter Request List option provides an indication to the DHCP 
server of what sort of configuration information the client is requesting. It con¬ 
sists of a string of bytes in which each byte indicates a particular option number. 
Here we can see that it includes conventional Internet information (subnet mask, 
domain name, DNS server, default router) but also a number of other options com¬ 
mon to Microsoft systems (i.e., NetBIOS options). It also includes an indication 
that the client is interested in knowing whether to perform ICMP Router Discov¬ 
ery (see Chapter 8) and whether any static forwarding table entries should be 
placed in the client's forwarding table when starting up (see Chapter 5). 


Note 

The reason there are three different types of static route parameters listed is 
a consequence of the history of addressing. Before the full adoption of subnet 
masks and network prefixes, the network portion of an address was known by 
inspection of the address alone (“classful addressing”), and this is the form of 
route used with the Static Route (33) parameter. With the adoption of classless 
routes, DHCP was updated to hold a mask that could be applied, resulting in the 
so-called Classless Static Route (CSR) parameter (121) defined in [RFC3442]. 
Microsoft’s variant (using code 249) is similar. 


The last parameter request (43) is for vendor-specific information. It is ordi¬ 
narily used in conjunction with the Vendor-Class Identifier option (60), to allow 
clients to receive nonstandard information, although another proposal combines 
the vendor's identity with the vendor-specific information [RFC3925], providing 
a method to determine the vendor given any vendor-specific information, even 
for a single client. In the case of Microsoft systems, vendor-specific information 
is used for selecting the use of NetBIOS, indicating whether a DHCP lease should 
be released on shutdown, and how the metric (preference) of a default route in the 
forwarding table should be processed. It is also used by Microsoft's Network Access 
Protection (NAP) system [MS-DHCPN]. Mac OS systems use vendor-specific infor¬ 
mation in supporting Apple's NetBoot service and Boot Server Discovery Protocol 
(BSDP) [F07]. 
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Upon receipt of the DHCPDISCOVER message, a DHCP server responds with 
an offer of an IP address, lease, and additional configuration information con¬ 
tained in a DHCPOFFER message. In the example shown in Figure 6-7, there is 
only one DHCP server (which is also a router and DNS server). 
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Message type: Boot Reply (2) 

Hardware type: Ethernet 
Hardware address length: 6 
Hops: 0 

Transaction ID: 0x3a681b0b 
seconds elapsed: 0 
SI Bootp flags: 0x8000 (Broadcast) 
client IP address: 0.0.0.0 (0.0.0.0) 

Your (client) IP address: 10.0.0.57 (10.0.0.57) 
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SI Option: (t=53,l=l) DHCP Message Type = DHCP offer 
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a Option: (t«51,l=4) IP Address Lease Time = 12 hours 
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a Option: (t®59,l=4) Rebinding Time value * 10 hours, 30 minutes 
a Option: (t=l,l=4) Subnet Mask = 255.255.255.128 
a Option: (t=28,l=4) Broadcast Address = 10.0.0.127 
a Option: (t=3,l=4) Router = 10.0.0.1 
a Option: (t=6,l=4) Domain Name server = 10.0.0.1 
a Option: (t=15,l=4) Domain Name = "home" 

End Option 
Padding 

> 


Figure 6-7 The DHCPOFFER sent from the DHCP server at 10.0.0.1 is offering IP address 10.0.0.57 
for up to 12 hours. Additional information includes the address of a DNS server, domain 
name, default router IP address, subnet mask, and broadcast address. In this example, 
the system with IP address 10.0.0.1 is the default router, DHCP server, and DNS server. 


In the DHCPOFFER message shown in Figure 6-7 we again see that the message 
format includes a BOOTP portion as well as a set of options that relate to its DHCP 
address handling. The BOOTP message type is BOOTREPLY. The client IP address 
provided by the server is 10.0.0.57, located in the “Your" [client] IP Address field. Note 
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that this address does not match the requested value of 172.16.1.34 contained in the 
DHCPDISCOVER message, as the 172.16/12 prefix is not in use on the local network. 

Additional information contained in the set of options includes the server's 
IP address (10.0.0.1), the lease time of the offered IP address (12 hours), and the T1 
(renewal) and T2 (rebinding) timeouts of 6 and 10.5 hours, respectively. In addition, 
the server provides the subnet mask for the client to use (255.255.255.128), the proper 
broadcast address (10.0.0.127), the default router and DNS server (all 10.0.0.1, the same 
as the DHCP server in this case), and a default domain name of "home". The domain 
name home is not standardized in any way and would not be used outside of a private 
network. This example is a home network, so by the author's convention the names 
of machines used on it have the form <name>.home. Once the client has collected a 
DHCPOFFER message and decided to attempt leasing the IP address 10.0.0.57 it has 
been offered, it continues with a second DHCPREQUEST message (see Figure 6-8). 


® vista-dhcp.tr - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony tools Help 

iH|a etctetH a 

No. Time Protocol Source Destination Info 

10.000000 DHCP 0.0.0.0 255.255.255.255 DHCP Request -Transaction ID 0xdb23147d 

20.018650 DHCP 10.0.0.1 255.255.255.255 DHCP NAK -Transaction ID 0xdb23147d 

3 1.083053 DHCP 0. 0. 0. 0 2 5 5.255.255.255 DHCP Discover - Transaction ID 0x3a681b0b 

4 4.084315 DHCP 10.0.0.1 255.255.255.255 DHCP Offer - Transaction ID 0x3a681b0b 


5 4.087406 DHCP 0. 0. 0. 0 2 5 5.255.255.255 DHCP Request - Transaction ID 0x3a681b0b 


6 4.104592 DHCP 10.0.0.1 2 5 5.255.255.255 DHCP ACK - Transaction ID 0x3a681b0b 

> 

E) Frame 5: 348 bytes on wire (2784 bits), 348 bytes captured (2784 bits) 

SI Ethernet II, src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: ff:ff:ff:ff:ff:ff (ff:ff:ff:ff:ff:ff) 
S) internet protocol, src: 0.0.0.0 (0.0.0.0), DSt: 2 55 . 2 55.2 55.2 55 (25 5.25 5.255.2 55) 

SI user Datagram protocol, src port: 68 (68), Dst port: 67 (67) 

3 Bootstrap Protocol 

Message type: Boot Request (1) 

Hardware type: Ethernet 
Hardware address length: 6 
Hops: 0 

Transaction id: 0x3a681b0b 
seconds elapsed: 0 
SI Bootp flags: 0x8000 (Broadcast) 
client IP address: 0.0.0.0 (0.0.0.0) 

Your (client) IP address: 0.0.0.0 (0.0.0.0) 

Next server IP address: 0.0.0.0 (0.0.0.0) 

Relay agent IP address: 0.0.0.0 (0.0.0.0) 

Client MAC address: 00:13:02:20:b9:18 (00:13:02:20:b9:18) 

Client hardware address padding: 00000000000000000000 
Server host name not given 
Boot file name not given 
Magic cookie: DHCP 

Si Option: (t=53,l=l) DHCP Message Type = DHCP Request 
SI Option: (t=61,l=7) Client identifier 
SI Option: (t=50,l=4) Requested IP Address = 10.0.0.57 
SI Option: (t=54,l=4) DHCP Server Identifier = 10.0.0.1 
SI Option: (t=12,l=5) Host Name = "vista" 

SI Option: (ta81,l=8) Client Fully Qualified Domain Name 
a Option: (t®60,l*8) Vendor class identifier = "MSFT 5.0" 
a Option: (t»55,l*12) Parameter Request List 
End Option 

> 


Figure 6-8 The second DHCPREQUEST indicates that the client wishes to be assigned the IP address 
10.0.0.57. The message is sent to the broadcast address and includes the address 10.0.0.1 
in the Server ID option. This allows any other servers that may receive the broadcast to 
know which DHCP server and address the client has selected. 
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The second DHCPREQUEST message, shown in Eigure 6-8, is similar to the 
DHCPDISCOVER message, except the requested IP address is now set to 10.0.0.57, 
the DHCP message type is set to DHCPREQUEST, the DHCP autoconfiguration 
option is not present, and the Server Identifier opfion is now filled in wifh fhe 
address of fhe server (10.0.0.1). Nofe fhaf fhis message, like fhe DHCPDISCOVER 
message, is senf using broadcast so any server or clienf presenf on fhe local nef- 
work receives if. The Server Idenfifier opfion field is used fo keep unselecfed 
servers from commiffing fhe address binding. When fhe selecfed server receives 
fhe DHCPREQUEST and commifs fhe binding, if ordinarily responds wifh a 
DHCPACK message, as we see in Eigure 6-9. 


5? vista-dhcp.tr - 

Wireshark 








1 File Edit View Go Capture 

Analyze Statistics Telephony tools 

Help 






SI w ai ii if 1^ s X SI a : h. ^ i|i3|iii 

<a. Q. <01. □ 

« a se 

SI 

No. Time 

Protocol 

Source 

Destination 

Info 






1 0.000000 

DHCP 

0.0.0.0 

255.255.255.255 

DHCP 

Request 

Transaction 

ID 

0xdb23147d 


2 0.018650 

DHCP 

10. 0. 0.1 

255.255.255.255 

DHCP 

NAK 

Transaction 

ID 

0xdb23147d 


5 1.083053 

DHCP 

0.0.0.0 

255.255.255.255 

DHCP 

Discover - 

Transaction 

ID 

0x3a681b0b 


4 4.084315 

DHCP 

10. 0. 0.1 

255.255.255.255 

DHCP 

offer 

Transaction 

ID 

0x3a681b0b 


5 4.087406 

DHCP 

0.0.0.0 

255.255.255.255 

DHCP 

Request 

Transaction 

ID 

0x3a681b0b 


6 4.104592 

DHCP 

10. 0. 0.1 

255.255.255.255 

DHCP 

ACK 

Transaction 

ID 

0x3a681b0b 



> 


Ill Frame 6: 355 bytes on wire (2840 bits), 355 bytes captured (2840 bits) 


E Ethernet ii, src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: ff:ff:ff:ff:ff:ff (ff:ff:ff:ff:ff:f1 
E internet protocol, src: 10.0.0.1 (10.0.0.1), DSt: 2 55.2 55.2 55.2 55 (25 5.255.255.2 55) 

E user Datagram protocol, src port: 67 (67), Dst port: 68 (68) 
s Bootstrap protocol 

Message type: Boot Reply (2) 

Hardware type: Ethernet 
Hardware address length: 6 
Hops: 0 

Transaction ID: 0x3a681b0b 
seconds elapsed: 0 
E Bootp flags: 0x8000 (Broadcast) 

Client IP address: 0.0.0.0 (0.0.0.0) 

Your (client) IP address: 10.0.0.57 (10.0.0.57) 

Next server IP address: 10.0.0.1 (10.0.0.1) 

Relay agent IP address: 0.0.0.0 (0.0.0.0) 

client MAC address: 00:13:02:20:b9:18 (00:13:02:20:b9:18) 

client hardware address padding: 00000000000000000000 

Server host name not given 

Boot file name not given 

Magic cookie: DHCP 

B Option: (t=»53,l=l) DHCP Message Type = DHCP ACK 
B Option: (t«54,l=4) DHCP Server Identifier = 10.0.0.1 
B Option: (t=51,l=4) IP Address Lease Time = 12 hours 
B Option: (t=58,l=4) Renewal Time value = 6 hours 
B Option: (t=59,l=4) Rebinding Time value = 10 hours, 30 minutes 
B Option: (t=l,l=4) Subnet Mask = 255.255.255.128 
B Option: (t=28,l=4) Broadcast Address = 10.0.0.127 
B Option: (t=3,l=4) Router = 10.0.0.1 
B Option: (t=6,l=4) Domain Name server = 10.0.0.1 
B Option: (t=15,l=4) Domain Name = "home" 

B Option: (t=81,l=13) client Fully Qualified Domain Name 
End option 


Figure 6-9 The DHCPACK message verifies to the client (and other servers) the allocation of address 
10.0.0.57 for up to 12 hours. 
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The DHCPACK message shown in Figure 6-9 is very similar to the DHCPOFFER 
message we have seen before. However, now the client's FQDN option is included 
as well. In this case (not shown), it is set to vista.home. At this point, the client 
is free to use the address 10.0.0.57, as far as the DHCP server is concerned. It is still 
advised to use techniques such as ACD, described in Chapter 4, to ensure that its 
address is not used by some other host. 

The DHCP messages exchanged in this example are typical of a system when 
it boots or is attached to a new network. It is also possible to induce a system to 
perform the release or acquisition of DHCP configuration information by hand. 
For example, in Windows the following command will release the data acquired 
using DHCP: 


C:\> ipconfig /release 


and the following command will acquire it: 

C:\> ipconfig /renew 

In Linux, the following commands can be used to achieve the same results: 


Linux# dhclient -r 


to release a DHCP lease, and 


Linux# dhclient 


to renew one. 

The type of information acquired by DHCP and assigned to the local system 
can be ascertained with a variant of the ipconfig command on Windows. Here 
is an excerpt from its output: 

C:\> ipconfig /all 


Wireless LAN adapter Wireless Network Connection: 


Connection-specific DNS Suffix . : home 

Description . : Intel(R) PRO/Wireless 3945ABG 

Network Connection 


Physical Address. 

DHCP Enabled. 

Autoconfiguration Enabled 

IPv4 Address. 

Subnet Mask . 

Lease Obtained. 

Lease Expires . 


00-13-02-20-B9-18 

Yes 

Yes 

10.0.0.57(Preferred) 
255.255.255.128 
Sunday, December 21, 2008 
11:31:48 PM 

Monday, December 22, 2008 
11:31:40 AM 


Default Gateway 


: 10.0.0.1 
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DHCP Server.: 10.0.0.1 

DNS Servers.: 10.0.0.1 

NetBIOS over Tcpip.: Enabled 


Connection-specific DNS Suffix Search List :home 

This command is very useful to see what configuration information has been 
assigned to a host using DHCP or other means. 

6.2.4.2 The DHCP State Machine 

The DHCP protocol operates a state machine at the clients and servers. The states 
dictate which types of messages the protocol is expecting to process next. The cli¬ 
ent state machine is illustrated in Figure 6-10. Transitions between states (arrows) 
occur because of messages that are received and sent or when timers expire. 



Figure 6-10 The DHCP client state machine. The boldface states and transitions are typical for a 
client first acquiring a leased address. The dashed line and INIT state are where the 
protocol begins. 


As shown in Figure 6-10, a client begins in the INIT state when it has no infor¬ 
mation and broadcasts the DHCPDISCOVER message. In the Selecting state, it col¬ 
lects DHCPOFFER messages until it decides which address and server it wishes 
to use. Once its selection has been made, it responds with a DHCPREQUEST mes¬ 
sage and enters the Requesting state. At this point it may receive ACKs for other 
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addresses it does not want. If it finds no address it wants, it sends a DHCPDECLINE 
and reverts to the INIT state. More likely, however, it receives a DHCPACK mes¬ 
sage for an address it wants, accepts it, obtains the timeout values T1 and T2, 
and enters the Bound state, where it is able to use the address until expiration. 
Upon the first timer expiration (timer Tl), the client enters the Renewing state and 
attempts to reestablish its lease. This succeeds if a fresh DHCPACK is received 
(returning the client to the Bound state). If not, T2 ultimately expires, causing the 
client to attempt to reacquire an address from any server. If the lease time finally 
expires, the client must give up the leased address and becomes disconnected if it 
has no alternative address or network connection to use. 

6.2.5 DHCPv6 

Although the IPv4 and IPv6 DHCP protocols achieve conceptually similar 
goals, their respective protocol designs and deployment options differ. DHCPv6 
[RPC3315] can be used in either a "stateful" mode, in which it works much like 
DHCPv4, or in a "stateless" mode in conjunction with stateless address autocon¬ 
figuration (see Section 6.3). In the stateless mode, IPv6 clients are assumed to self- 
contigure their IPv6 addresses but require additional information (e.g., DNS server 
address) obtained using DHCPv6. Another option exists for deriving the location 
of a DNS server using ICMPv6 Router Advertisement messages (see Chapters 8 
and 11 and [RPC6106]). 

6.2.5.1 IPv6 Address Lifecycle 

IPv6 hosts usually operate with multiple addresses per interface, and each address 
has a set of timers indicating how long and for what purposes the corresponding 
address can be used. In IPv6, addresses are assigned with a preferred lifetime and 
valid lifetime. These lifetimes are used to form timeouts that move an address 
from one state to another in an address's state machine (see Pigure 6-11). 


DAD Preferred Valid 



Figure 6-11 The lifecycle of an IPv6 address. Tentative addresses are used only for DAD until veri¬ 
fied as unique. After that, they become preferred and can be used without restriction 
until an associated timeout changes their state to deprecated. Deprecated addresses are 
not to be used for initiating new connections and may not be used at all after the associ¬ 
ated valid timeout expires. 
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Figure 6-11 shows the lifecycle of an IPv6 address. An address is in fhe pre¬ 
ferred sfafe when if is available for general use and is available as eifher a source 
or desfinafion IPv6 address. A preferred address becomes deprecafed when ifs 
preferred fimeouf occurs. When if becomes deprecafed, if may sfill be used for 
exisfing fransporf (e.g., TCP) connecfions buf is nof fo be used for inifiafing new 
connecfions. 

When an address is firsf selecfed for use, if enfers a tentative or optimistic sfafe. 
When in fhe fenfafive sfafe, if may be used only for fhe IPv6 Neighbor Discovery 
profocol (see Chapfer 8). If is nof used as a source or desfinafion address for any 
ofher purposes. While in fhis sfafe fhe address is being checked for duplicafion, 
fo see if any ofher nodes on fhe same nefwork are already using fhe address. The 
procedure for doing fhis is called duplicate address detection (DAD) and is described 
in more defail in Secfion 6.3.2.I. An alfernafive fo convenfional DAD is called opti¬ 
mistic DAD [RFC4429], whereby a selecfed address is used for a limifed sef of 
purposes unfil DAD complefes. Because an opfimisfic use of an address is really 
jusf a special sef of rules for DAD, if is nof a fruly complefe sfafe ifself. Opfimisfic 
addresses are freafed as deprecafed for mosf purposes. In parficular, an address 
may be bofh opfimisfic and deprecafed simulfaneously, depending on fhe pre¬ 
ferred and valid lifefimes. 

6.2.5.2 DHCPv6 Message Format 

DFICPv6 messages are encapsulafed as UDP/IPv6 dafagrams, wifh clienf porf 546 
and server porf 547 (see Chapfer 10). Messages are senf using a hosf's link-scoped 
source address fo eifher relay agenfs or servers. There are fwo message formafs, 
one used direcfly befween a clienf and a server, and anofher when a relay is used 
(see Figure 6-12). 


Message Type 

Transaction ID (24 bits) 


Message Type 

Hop Count 


1 Options (variable) i 


Link Address 


Client/server 
Message Format 


Peer Address 


Options (variable) 


Relay Agent 
Message Format 

Figure 6-12 The basic DHCPv6 message format (left) and relay agent message format (right). Most interesting 
information in DHCPv6 is carried in options. 
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The primary DHCPv6 message format is given in Figure 6-12 on the left and 
an extended version, which includes the Link Address and Peer Address fields, is 
given on the right. The format on the right is used between a DHCPv6 relay agent 
and a DHCPv6 server. The Link Address field gives the global IPv6 address used 
by the server to identify the link on which the client is located. The Peer Address 
field contains the address of the relay agent or client from which the message to be 
relayed was received. Note that relaying may be chained, so a relay may be relay¬ 
ing a message received from another relay. Relaying, for DHCPv4 and DHCPv6, is 
described in Section 6.2.6. 

The message type for messages in the format on the left include typical DHCP- 
style messages (REQUEST, REPLY, etc.), whereas the message types for messages 
in the format on the right include RELAY-EORW and RELAY-REPL, to indicate a 
message forwarded from a relay or destined to a relay, respectively. The Options 
field for the format on the right always includes a Relay Message option, which 
includes the complete message being forwarded by the relay. Other options may 
also be included. 

One of the differences between DHCPv4 and DHCPv6 is how DHCPv6 uses 
IPv6 multicast addressing. Clients send requests to the All DHCP Relay Agents and 
Servers multicast address (ff02::l:2). Source addresses are of link-local scope. In 
IPv6, there is no legacy BOOTP message format. The message semantics, however, 
are similar. Table 6-1 gives the types of DHCPv6 messages, their values, defining 
RECs, and the roughly equivalent message and defining REC for DHCPv4. 


Table 6-1 DHCPv6 message types, values, and defining standards. The approximately equivalent message 
types for DHCPv4 are given to the right. 


DHCPv6 Message 

DHCPv6 

Value 

Reference 

DHCPv4 Message 

Reference 

SOLICIT 

1 

[RFC3315] 

DISCOVER 

[RFC2132] 

ADVERTISE 

2 

[RFC3315] 

OFFER 

[RFC2132] 

REQUEST 

3 

[RFC3315] 

REQUEST 

[RFC2132] 

CONFIRM 

4 

[RFC3315] 

REQUEST 

[RFC2132] 

RENEW 

5 

[RFC3315] 

REQUEST 

[RFC2132] 

REBIND 

6 

[RFC3315] 

DISCOVER 

[RFC2132] 

REPLY 

7 

[RFC3315] 

ACK/NAK 

[RFC2132] 

RELEASE 

8 

[RFC3315] 

RELEASE 

[RFC2132] 

DECLINE 

9 

[RFC3315] 

DECLINE 

[RFC2132] 
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Table 6-1 DHCPv6 message types, values, and defining standards. The approximately equivalent message 
types for DHCPv4 are given to the right (continued). 


DHCPv6 Message 

DHCPv6 

Value 

Reference 

DHCPv4 Message 

Reference 

RECONFIGURE 

10 

[RFC3315] 

FORCERENEW 

[RFC3203] 

INFORMATION-REQUEST 

11 

[RFC3315] 

INFORM 

[RFC2132] 

RELAY-FORW 

12 

[RFC3315] 

N/A 


RELAY-REPL 

13 

[RFC3315] 

N/A 


LEASEQUERY 

14 

[RFC5007] 

LEASEQUERY 

[RFC4388] 

LEASEQUERY-REPLY 

15 

[RFC5007] 

LEASE lUNASSIGNED, 
UNKNOWN, ACTIVE) 

[RFC4388] 

LEASEQUERY-DONE 

16 

[RFC5460] 

LEASEQUERYDONE 

[ID4LQ] 

LEASEQUERY-DATA 

17 

[RFC5460] 

N/A 

N/A 

N/A 

N/A 

N/A 

BULKLEASEQUERY 

[ID4LQ] 


In DHCPv6, most interesting information, including addresses, lease times, 
location of services, and client and server identifiers, is carried in options. Two of 
the more important concepts used with these options are called the Identity Asso- 
ciation (lA) and the DHCP Unique Identifier (DUID). We discuss them next. 

6.2.5.3 Identity Association (lA) 

An Identity Association (lA) is an identifier used between a DHCP client and server 
to refer to a collection of addresses. Each lA comprises an lA identifier (lAID) 
and associated configuration information. Each client interface that requests a 
DHCPv6-assigned address requires at least one lA. Each lA can be associated with 
only a single interface. The client chooses the lAID to uniquely identify each lA, 
and this value is then shared with the server. 

The configuration information associated with an lA includes one or more 
addresses and associated lease information (Tl, T2, and total lease duration val¬ 
ues). Each address in an IA has both a preferred and a valid lifetime [REC4862], 
which define the address's lifecycle. The types of addresses requested may be 
regular addresses or temporary addresses [REC4941]. Temporary addresses are 
derived in part from random numbers to help improve privacy by frustrating the 
tracking of IPv6 hosts based on IPv6 addresses. Temporary addresses are ordinar¬ 
ily assigned at the same time nontemporary addresses are assigned but are regen¬ 
erated using a different random number more frequently. 

When responding to a request, a server assigns one or more addresses to a 
client's lA based on a set of address assignment policies determined by the server's 
administrator. Generally, such policies depend on the link on which the request 
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arrived, standard information about the client (see DUID in Section 6.2.5.4), and 
other information supplied by the client in DHCP options. The formats of the lA 
option for nontemporary and temporary addresses are as shown in Figure 6-13. 


0 


1516 


31 0 


1516 


31 


OPTION lA NA 


Option Length 


lAID (4 bytes) 
T1 


T2 


IA_NA Options (variable) 


OPTION_IA_TA 

Option Length 

lAID (4 bytes) 


IA_TA Options (variable) 


lA for Temporary 
Address Option 


lA for Nontemporary 
Address Option 

Figure 6-13 The format for a DHCPv6 IA for nontemporary addresses (left) and temporary addresses (right). 

Each option may include additional options describing particular IPv6 addresses and corre¬ 
sponding leases. 


The main difference befween a nonfemporary and a femporary address lA 
opfion, as shown in Figure 6-13, is fhe inclusion of fhe T1 and T2 values in fhe 
nonfemporary case. These values are expecfed, as fhey are also fhe values used in 
DHCPv4. For femporary addresses, fhe lack of T1 and T2 is made possible because 
fhe lifefimes are generally defermined based upon fhe T1 and T2 values assigned 
fo a nonfemporary address fhaf has been acquired previously. Defails of fempo¬ 
rary addresses are given in [RFC4941]. 

6.2.5.4 DHCP Unique Identifier (DUID) 

A DHCP Unique Identifier (DUID) idenfifies a single DHCPv6 clienf or server and 
is designed fo be persisfenf over fime. If is used by servers fo idenfify clienfs for 
fhe selecfion of addresses (as parf of lAs) and configurafion informafion, and 
by clienfs fo idenfify fhe server in which fhey are inferesfed. DUIDs are variable 
in lengfh and are freafed as opaque values by bofh clienfs and servers for mosf 
purposes. 

DUIDs are supposed fo be globally unique yef easy fo generafe. To safisfy 
fhese concerns simulfaneously, [RFC3315] defines fhree differenf fypes of possible 
DUIDs buf also menfions fhaf fhese are nof fhe only fhree fypes fhaf mighf ever be 
creafed. The fhree fypes of DUIDs are as follows: 

1. DUID-LLT: a DUID based on link-layer address plus fime 

2. DUID-EN: a DUID based on enferprise number and vendor assignmenf 

3. DUID-LL: a DUID based on link-layer address only 
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The standard format for encoding a DUID begins with a 2-byte identifier indi¬ 
cating which type of DUID is being expressed. The current list is maintained by 
the lANA [ID6PARAM]. This is followed by a 16-bit hardware type derived from 
[RFC0826] in the cases of DUID-LLT and DUID-LL, and a 32-bit Private Enterprise 
Number in the case of DUID-EN. 


Note 

A Private Enterprise Number (PEN) is a 32-bit value given out by the lANA to an 
enterprise. It is usually used in conjunction with the SNMP protocol for network 
management purposes. About 38,000 of them have been assigned as of mid- 
2011. The current list is available from the lANA [lEPARAM]. 


The first form of DUID, DUID-LLT, is the recommended form. Eollowing 
the hardware type, it includes a 32-bit timestamp containing the number of sec¬ 
onds since midnight (UTC), January 1, 2000 (mod 2^^). This rolls over (returns 
to zero) in the year 2136. The last portion is a variable-length link-layer address. 
The link-layer address can be selected from any of the host's interfaces, and the 
same DUID should be used, once selected, for traffic on any interface. This form of 
DUID is required to be stable even if the network interface from which the DUID 
was derived is removed. Thus, it requires the host system to maintain stable stor¬ 
age. The DUID-LL form is very similar but is recommended for systems lacking 
stable storage (but having a stable link-layer address). The REC says that a DUID- 
LL must not be used by clients or servers that cannot determine if the link-layer 
address they are using is associated with a removable interface. 

6.2.5.5 Protocol Operation 

The DHCPv6 protocol operates much like its DHCPv4 counterpart. Whether or 
not a client initiates the use of DHCP is dependent on configuration options car¬ 
ried in an ICMPv6 Router Advertisement message the host receives (see Chapter 
8). Router advertisements include two important bit fields. The M field is the Man¬ 
aged Address Configuration flag and indicates that IPv6 addresses can be obtained 
using DHCPv6. The O field is the Other Configuration flag and indicates that infor¬ 
mation other than IPv6 addresses is available using DHCPv6. Both fields, along 
with several others, are specified in [RFC5175]. Any combination of the M and O 
bit fields is possible, although having M on and O off is probably the least useful 
combination. If both are off, DHCPv6 is not used, and address assignment takes 
place using stateless address autoconfiguration, described in Section 6.3. Having 
M off and O on indicates that clients should use stateless DHCPv6 and obtain their 
addresses using stateless address autoconfiguration. The DHCPv6 protocol oper¬ 
ates using the messages defined in Table 6-1 and illustrated in Figure 6-14. 

Typically, a client starting out first determines what link-local address to use 
and performs an ICMPv6 Router Discovery operation (see Chapter 8) to determine 
if there is a router on the attached network. A router advertisement includes the M 
and O bit fields mentioned previously. If DHCPv6 is in use, at least the M bit field 
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Figure 6-14 Basic operation of DHCPv6. A client determines whether or not to use DHCPv6 from 
information carried in ICMPv6 router advertisements. If used, DHCPv6 operations are 
similar to those in DHCPv4 but differ significantly in the details. 


is set and the client multicasts (see Chapter 9) the DHCPSOLICIT message to find 
DHCPv6 servers. A response comes in the form of one or more DHCPADVERTISE 
messages, indicating the presence of at least one DHCPv6 server. These messages 
constitute two of the so-called/oMr-messa^e exchange operations of DHCPv6. 

In cases where the location of a DHCPv6 server is already known or an address 
need not be allocated (e.g., stateless DHCPv6 or the Rapid Commit option is being 
used—see Section 6.2.9), the four-message exchange can be shortened to become 
a two-message exchange, in which case only the REQUEST and REPLY messages 
are used. A DHCPv6 server commits a binding formed from the combination of 
a DUID, lA type (temporary, nontemporary, or prefix—see Section 6.2.5.3), and 
lAlD. The lAID is a 32-bit number chosen by the client. Each binding can have 
one or more leases, and one or more bindings can be manipulated using a single 
DHCPv6 transaction. 

6.2.5.6 Extended Example 

Eigure 6-15 shows an example of a Windows Vista (Service Pack 1) machine attach¬ 
ing to a wireless network. Its IPv4 stack has been disabled. It begins by assigning 
its link-local address and checking to see if that address is already being used. 
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DHCPV6 

fe80: 
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ff02: 
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Request XID: 0xe3410c CID: 000100010ddl4b2e001422f4195f IAA: 2001:db8:0:fl01::10fd 
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DHCPV6 

feSO: 
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Reply XID: 0xe3410c CID: 000100010ddl4b2e001422f4195f IAA: 2001:db8:0:fl01::10fd v 


S Frame 1: 78 bytes on wire (624 bits), 78 bytes captured (624 bits) 

a Ethernet II, Src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: 33:33:ff:b7:40:5a (33:33:ff:b7:40:5a) 
a Internet Protocol version 6, src: :: (::), Dst: ff02::1:ffb7:405a (ff02::l:ffb7:405a) 
a Internet control Message Protocol v6 
Type: 135 (Neighbor solicitation) 
code: 0 

Checksum: 0xc449 [correct] 

Reserved: 0 (should always be zero) 

Target: fe80::fd26:de93:5ab7:405a (fe80::fd26:de93:5ab7:405a) 


Figure 6-15 


DAD for the client system's link-local address is a Neighbor Solicitation for its own IPv6 address. 
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In Figure 6-15 we see the ICMPv6 Neighbor Solicitation (DAD) for the client's 
optimistic address fe80::fd26:de93:5ab7:405a. (DAD is described in more detail 
when we discuss stateless address autoconfiguration in Section 6.3.2.I.) The packet 
is sent to the corresponding solicited-node address ff02::l:ffb7:405a. It optimisti¬ 
cally assumes that this address is not otherwise in use on the link, so it continues 
on immediately with a Router Solicitation (RS) (see Figure 6-16). 

The RS shown in Figure 6-16 is sent to the All Routers multicast address ff02::2. 
It induces each router on the network to respond with a Router Advertisement 
(RA), which carries the important M and O bits the client requires to determine 
what to do next. 


Note 

This example shows a router solicitation being sent from an optimistic address 
including a source link-layer address option (SLLAO), in violation of [RFC4429]. 
The problem here is potential pollution of neighbor caches in any listening IPv6 
routers. They will process the option and establish a mapping in their neighbor 
caches between the tentative address and the link-layer address that may be a 
duplicate. However, this is very unlikely and is probably not of significant concern. 
Nonetheless, a pending “optimistic” option [IDDN], if standardized, will allow a 
router solicitation to include an SLLAO that avoids this issue. 


The RA in Figure 6-17 indicates the presence of a router, including its SLLAO 
of 00:04:5a:9f:9e:80, which will be useful to the client for encapsulating subsequent 
link-layer frames destined for the router. The Flags field indicates that the M and 
O bit fields are both enabled (set to 1), so the client should proceed with DHCPv6, 
both for obtaining its addresses as well as for obtaining other configuration infor¬ 
mation. This is accomplished by soliciting a DHCPv6 server (see Figure 6-18). 

The DHCPv6 SOLICIT message shown in Figure 6-18 includes a transaction 
ID (as in DHCPv4), an elapsed time (0, not shown), and the DUID consisting of a 
time and 6-byte MAC address. In this example, the MAC address 00:14:22:f4:19:5f 
is the MAC address of the wired Ethernet interface on this client, which is not the 
interface being used to send the SOLICIT message. Recall that for DUID-LL and 
DUID-TLL types of DUIDs the link-layer information should be the same across 
interfaces. The lA is for a nontemporary address, and the client has selected the 
lAID 09001302. The time values are left at 0 in the request, meaning that the client 
is not expressing a particular desire; they will be determined by the server. 

The next option is the FQDN option specified by [RFC4704]. It is used to carry 
the FQDN of the client but also to affect how DHCPv6 and DNS interact (see Sec¬ 
tion 6.4 on DHCP and DNS interaction). This option is used to enable dynamic 
updates to FQDN-to-IPv6 address mapping by client or server. (The reverse is 
generally handled by the server.) The first portion of this option contains three 
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Frame 2: 70 bytes on wire (560 bits), 70 bytes captured (560 bits) 


s 

Ethernet II 

src: 00:13: 

02:20:b9:18 (00:13:02 

:20:b9:18), Dst: 33:33:00: 

00:00:02 (33:33:00:00:00:02) 

s 

Internet Protocol version 6, src: fe80::fd26: 

de93:5ab7:405a (fe80::fd26 

:de93:5ab7:405a), Dst: ff02::2 (ff02::2) 

a 

Internet control Message 

Protocol v6 





Type: 133 

(Router 

sol i citati on) 





Code: 0 

Checksum: 0x4al6 [correct] 

S ICMPV6 option (Source link-layer address) 
Type: source link-layer address (1) 
Length: 8 

Link-layer address: 00:13:02:20:b9:18 


Figure 6-16 


The Router Solicitation induces a nearby router to provide a Router Advertisement. The solicitation message is sent to the All Routers address 
(ff02: ;2). 
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3 0.001886 ICMPV6 fe80::204:5aff:fe9f:9e80 ff02::l Router advertisement from 00:04:5a:9f:9e:80 


> 

Q Frame 3: 78 bytes on wire (624 bits), 78 bytes captured (624 bits) 

a Ethernet II, src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 33:33:00:00:00:01 (33:33:00:00:00:01) 
a internet protocol version 6, Src: fe80::204:5aff:fe9f:9e80 (fe80::204:5aff:fe9f:9e80), Dst: ff02::l (ff02::l) 
a Internet control Message Protocol v6 
Type: 134 (Router advertisement) 
code: 0 

checksum: 0x4017 [correct] 

Cur hop limit: 64 
Q Flags: OxcO 

1. * Managed 

.1.= other 

..0.= Not Home Agent 

...00... = Router preference: Medium 

.0.. = Not Proxied 

Router lifetime: 1800 
Reachable time: 0 
Retrans timer: 0 

S ICMPV6 Option (Source link-layer address) 

Type: source link-layer address (1) 

Length: 8 

Link-layer address: 00:04:5a:9f:9e:80 

> 


Figure 6-17 A Router Advertisement indicates that addresses are managed (available by assignment using 
DHCPv6) and that other information (e.g., DNS server) is also available using DHCPv6. This net¬ 
work uses stateful DHCPv6. IPv6 Router Advertisement messages use ICMPv6 (see Chapter 8). 


bit fields: N (server should not perform update), O (client request overridden by 
server), and S (server should perform update). The second portion of the option 
contains a domain name, which may be fully qualified or not. 


Note 

The Wireshark tool indicates that the FQDN name record in Figure 6-18 is mai- 
formed and specuiates that the packet may have been generated by a MS Vista 
client, which indeed it was. The reason the field is malformed is because the origi- 
nai specification for this option ailowed a simpie domain name encoding using 
ASCil characters. This method has been deprecated by [RFC4704], and the two 
encodings are not directiy compatible. Microsoft provides a “hotfix” to address 
this issue for Vista systems. Microsoft Windows 7 systems exhibit behavior com- 
piiant with [RFC4704]. 


Other information in the solicitation message includes the identification of the 
vendor class and requested option list. In this case, the vendor class data includes 
the string "MSFT 5.0”, which can be used by a DHCPv6 server to determine what 
types of processing the client is capable of doing. In response to the client's solici¬ 
tation, the server responds with an ADVERTISE message (see Eigure 6-19). 
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4 0.281512 DHCPV6 fe80::fd26:de93:5ab7:405a ff02::l:2 Solicit XID: 0xe3410c CID: 000100010dcll4b2e001422f4195f 


Q Frame 4: 146 bytes on wire (1168 bits), 146 bytes captured (1168 bits) 

a Ethernet ii, src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: 33:33:00:01:00:02 (33:33:00:01:00:02) 
a Internet Protocol version 6, src: fe80::fd26:de93:5ab7:405a (fe80::fd26:de93:5ab7:405a), Dst: ff02::l:2 (ff02::l:2) 
a User Datagram Protocol, Src Port: 546 (546), Dst Port: 547 (547) 
a DHCPV6 

Message type: solicit (1) 

Transaction ID: 0xe3410c 
a Elapsed time 

a Client Identifier: 000100010ddl4b2e001422f4195f 
Option: Client Identifier (1) 

Length: 14 

value: 000100010ddl4b2e001422f4195f 

DUID type: link-layer address plus time (1) 

Hardware type: Ethernet (1) 

Time: May 6, 2007 19:27:58 Pacific Daylight Time 
Link-1ayer address: 00:14:22:f4:19:5f 
a Identity Association for Non-temporary Address 

option: identity Association for Non-temporary Address (3) 

Length: 12 

value: 090013020000000000000000 
lAID: 09001302 
Tl: 0 
T2: 0 

a Fully Qualified Domain Name 

option: Fully Qualified Domain Name (39) 

Length: 6 

value: 007669737461 
0000 0... = Reserved: 0x00 

.0.. = N bit: Server should perform DNS updates 

.0. = 0 bit: server has not overridden client's S bit preference 

.0 = s bit: server should not perform forward dns updates 

Malformed DNS name record (MS vista client?) 
a vendor cl ass 

Option: vendor class (16) 

Length: 14 

value: 0000013700084d53465420352e30 
Enterprise ID: Microsoft (311) 
vendor-class-data: "msft 5.0" 
a Option Request 

Option: option Request (6) 

Length: 8 

value: 0018001700110027 

Requested Option code: Domain search List (24) 

Requested option code: DNS recursive name server (23) 

Requested Option code: vendor-specific information (17) 

Requested Option code: Fully Qualified Domain Name (39) 

II > 


Figure 6-18 The DHCPv6 SOLICIT message requests the location of one or more DHCPv6 servers and includes 
information identifying the client and the options in which it is interested. 


The ADVERTISE message shown in Eigure 6-19 provides a wealth of infor¬ 
mation to the client. The Client Identifier option echoes the client's configuration 
information. The Server Identifier option gives the time plus a link-layer address 
of 10:00:00:00:09:20 to identify the server. The lA has the value lAID 09001302 
(provided by the client) and includes the global address 2001:db8:0:fl01::10fd with 
preferred lifetime and valid lifetime of 130 and 200s, respectively (fairly short 
timeouts). The status code of 0 indicates success. Also provided with the DHCPv6 
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EB Frame 5: 182 bytes on wire (1456 bits), 182 bytes captured (1456 bits) 

B Ethernet ii, src: 00:04:Sa:9f:9e:80 (00:04:Sa:9f:9e:80), Dst: 00:13:02:20:b9:18 (00:13:02:20:b9:18) 

B internet Protocol version 6, src: feSO::204:5aff:fe9f:9e80 (fe80::204:5aff:fe9f:9e80), ost: fe80::fd26:de93:5ab7:405a (feSO::fd26:de93:5ab7:405a) 

B user Datagram protocol, src Port: 547 (547), Dst Port: 546 (546) 
a 0HCPV6 

Message type: Advertise (2) 

Transaction ID: 0xe3410c 

B Client Identifier: 000100010ddl4b2e001422f4195f 
a server identifier: 000100010ddl4b2el00000000920 
option: server identifier (2) 

Length: 14 

value: oooioooioddi4b2eioooooooo920 

DUID type: link-layer address plus time (1) 

Hardware type: Ethernet (1) 

Time: May 6, 2007 19:27:58 Pacific Daylight Time 
Link-layer address: 10:00:00:00:09:20 
a Identity Association for Non-temporary Address 

option: identity Association for Non-temporary Address (3) 

Length: 46 

value: 090013020000003c0000005a0005001e20010db80000fl01... 

IAID: 09001302 
Tl: 60 
T2: 90 

a lA Address: 2001:db8:0:fl01::10fd 
option: lA Address (5) 

Length: 30 

value: 20010db80000fl0100000000000010fd00000082000000c8... 

IPv6 address: 2001:db8:0:fl01::10fd 
preferred lifetime: 130 
valid lifetime: 200 
a Status code 

Option: Status code (13) 

Length: 2 
value: 0000 

status code: success (0) 
a DNS recursive name server 

option: DNS recursive name server (23) 

Length: 16 

value: 20010db80000fl010000000000000001 

DNS servers address: 2001:db8:0:fl01::1 
a Domain Search List 

Option: Domain Search List (24) 

Length: 6 

value: 04 686f6d6 5 00 
DNS Domain search List 
Domain: home 

> 

Figure 6-19 The DHCPv6 ADVERTISE message includes an address and lease, plus DNS server IPv6 address 
and domain search list. 

advertisement is the DNS Recursive Name Server option [RFC3646] indicating a 
server address of 2001:db8:0:fl01::l and a Domain Search List option containing 
the string home. Note that the server does not include an FQDN option, as it does 
not implement that option. 

The next two packets are a conventional Neighbor Solicitation and Neighbor 
Advertisement messages between the client and the router, which we do not detail 
further. That exchange is followed by the client's request for a commitment of the 
global nontemporary address 2001:db8:0:fl01::10fd (see Figure 6-20). 

The REQUEST message shown in Eigure 6-20 is very similar to the SOLICIT 
message but includes the information carried in the ADVERTISE message from 
the server (address, Tl, and T2 values). The transaction ID remains the same for 
all of the DHCPv6 messages we have seen. The exchange is completed with the 
REPLY message, which is identical to the ADVERTISE message except for the dif¬ 
ferent message type and therefore is not detailed. 







Section 6.2 Dynamic Host Configuration Protocol (DHCP) 


265 


C vjstd-dhcpv6.tr - Wireshdrk 


RIe Edit View Go Capture Analyze Statistics Telephony Tools Help 

61 H W at (K B e X a a I Q. 7 £ l|B|Q 8iEimS« S! 


No. Time Protocol Source Destiriatlon Info 



a Frame 8: 192 bytes on wire (1536 bits]), 192 bytes captured (1536 bits) 

ffl Ethernet II, Src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: 33:33:00:01:00:02 (33:33:00:01:00:02) 
a internet Protocol version 6, src: fe80::fd26:de93:5ab7:405a (fe80::fd26:de93:5ab7:405a), ost: ff02::l:2 (ff02::l:2) 
a user Datagram Protocol, Src Port: 546 (546), Dst Port: 547 (547) 
a DHCPV6 

Message type: Request (3) 

Transaction id: 0xe3410c 
a Elapsed time 

a Client Identifier: 000100010ddl4b2e001422f4195f 
a Server Identifier: 000100010ddl4b2el00000000920 
a Identity Association for Non-temporary Address 

Option: Identity Association for Non-temporary Address (3) 

Length: 40 

value: 090013020000003c0000005a0005001820010db80000fl01... 

IAID: 09001302 
Tl: 60 
T2: 90 

a lA Address: 2001:db8:0:fl01::10fd 
option: ia Address (5) 

Length: 24 

value: 20010db80000fl0100000000000010fd00000082000000c8 

IPv6 address: 2001:db8:0:fl01::10fd 
Preferred lifetime: 130 
valid lifetime: 200 
a Fully Qualified Domain Name 
a vendor class 
a Option Request 


Figure 6-20 The DHCPv6 REQUEST message is similar to a SOLICIT message but includes information 
learned from the server's ADVERTISE message. 


The DHCPv6 messages exchanged in this example are typical of a system 
when it boots or is attached to a new network. As with DHCPv4, it is possible to 
induce a system to perform fhe release or acquisifion of fhis informafion by hand. 
For example, in Windows fhe following command will release fhe dafa acquired 
using DHCPv6: 


C:\> ipconfig /releases 


and fhe following command will acquire if: 


C:\> ipconfig /renewS 


The type of informafion acquired by DHCP and assigned fo fhe local infer- 
face can be ascerfained wifh anofher varianf of fhis command fhaf we have seen 
before. Here is an excerpf of ifs oufpuf: 

C:\> ipconfig /all 

Wireless LAN adapter Wireless Network Connection: 

Connection-specific DNS Suffix . : home 

Description . : Intel(R) PRO/Wireless 3945ABG 

Network Connection 
: 00-13-02-20-B9-18 
: Yes 


Physical Address 
DHCP Enabled. . 
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Autoconfiguration Enabled .... 

IPv6 Address. 

Lease Obtained. 

Lease Expires . 

Link-local IPv6 Address . 

feSO 

Default Gateway . 

DHCPv6 IAID . 

DHCPv6 Client DUID. 


Yes 

2001:dbS:0:f101::12cd(Preferred) 
Sunday, December 21, 2008 
11:30:45 PM 

Sunday, December 21, 2008 
11:37:04 PM 

:fd26:de93:5ab7:405a%9(Preferred) 
fe80::204:5aff:fe9f:9e80%9 
150999810 


00-01-00-01-0D-D1-4B-2E-00-14-22-F4-19-5F 

DNS Servers . : 2001:db8:0:f101::1 

NetBIOS over Tcpip.: Disabled 

Connection-specific DNS Suffix Search List : 

home 


Here we can see the link-layer address of the system (00:13:02:20 :b9:18). 
Note how this address was never used as a basis for forming the IPv6 addresses 
in this example. 

6.2.57 DHCPvG Prefix Delegation (DHCPv6-PD and 6rd) 

Although the discussion so far has revolved around configuring hosts, DHCPv6 
can also be used to configure routers. This works by having one router delegate a 
range of address space to another router. The range of addresses is described by 
an IPv6 address prefix. The prefix is carried in a DHCP Prefix option, defined by 
[RFC3633]. This is used in situations where the delegating router, which now acts 
as a DHCPv6 server as well, does not require detailed topology information about 
the network to which the prefix is being delegated. Such a situation can arise, for 
example, when an ISP gives out a range of IP addresses to be used and potentially 
reassigned by a customer. In such a circumstance, the ISP may choose to delegate 
a prefix to the customer's premises equipment using DHCPv6-PD. 

With prefix delegation, a new form of lA called an IA_PD is defined. Each 
IA_PD consists of an lAID and associated configuration information and is simi¬ 
lar to an lA for addresses, as discussed previously. DHCPv6-PD is useful not only 
for prefix delegation for fixed routers, but is also suggested to be used when rout¬ 
ers (and their attached subnets) can be mobile [RFC6276]. 

A special form of PD (6rd, described in [RFC5569]) has been created for support¬ 
ing IPv6 rapid deployment by service providers. The OPTION_6RD (212) option 
[RFC5969] holds the IPv6 6rd prefix that is used in assigning IPv6 addresses at a 
customer's site based on the customer's assigned IPv4 address. IPv6 addresses are 
algorithmically assigned by taking the service provider's provisioned 6rd prefix as 
the first n bits, with n being recommended as less than 32. A customer's assigned 
unicast IPv4 address is then appended as the next 32 (or fewer) bits, resulting in an 
IPv6 6rd delegated prefix that is handled identically to DHCPv6-PD and is recom¬ 
mended to be 64 bits or shorter in length to allow automatic address configuration 
(see Section 6.4) to operate without problems. 
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The OPTION_6RD option is variable in length and includes the follow¬ 
ing values: the IPv4 mask length, 6rd prefix length, 6rd prefix, and a list of 6rd 
relay addresses (IPv4 addresses of relays that provide 6rd). The IPv4 mask length 
gives the number of bits from the IPv4 address to use in assigning IPv6 addresses 
(counted from the left). 

6.2.6 Using DHCP with Reiays 

In most simple networks, a single DHCP server is made available directly to cli¬ 
ents on the same LAN. However, in more complicated enterprises it may be neces¬ 
sary or convenient to relay DHCP traffic through one or more DHCP relay agents, 
as illustrated in Figure 6-21. 



Figure 6-21 A DHCP relay agent extends the operation of DHCP beyond a single network segment. 

Information carried only between relays and DHCPv4 servers can be carried in the 
Relay Agent Information option. Relaying in DHCPv6 works in a similar fashion but 
with a different set of options. 


A relay agent is used to extend the operation of DHCP across multiple network 
segments. In Figure 6-21 the relay between network segments A and B forwards 
DHCP messages and may annotate the messages with additional information 
using options or by filling in empty fields. Note that in ordinary circumstances, 
a relay does not participate in all DHCP traffic exchanged between a client and 
a server. Rather, it relays only those messages that are broadcast (or multicast in 
IPv6). Such messages are usually exchanged when a client is obtaining its address 
for the first time. Once a client has acquired an IP address and the server's IP 
address using the Server Identification option, it can carry out a unicast conversa¬ 
tion with the server that does not involve the relay. Note that relay agents have tra¬ 
ditionally been layer 3 devices and tend to incorporate routing capabilities. After 
discussing the basics of layer 3 relays, we will look briefly at alternatives that oper¬ 
ate (mostly) at layer 2. 
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6.2.6.1 Relay Agent Information Option 

In the original concept of a BOOTP or DHCP relay [RFC2131], a relay agent served 
the purpose only of relaying a message from one subnet to another that would 
otherwise not be passed on by a router. This allowed systems that could not 
yet perform indirect delivery to acquire an address from a centralized location. 
This is sensible for a network operating in an enterprise under one administra¬ 
tive authority, but in cases where DHCP is used at a subscriber's premises and 
the DHCP infrastructure is provided elsewhere (e.g., an ISP), more information 
may be required. There are a number of possible reasons. For example, the ISP 
may not trust the subscriber completely, or billing and logging may be associated 
with other information not available in the basic DHCP protocol. It has therefore 
become useful to include extra information in the messages that pass between the 
relay and the server. The Relay Agent Information option (for DHCPv4, abbrevi¬ 
ated RAIO) [RFC3046] provides ways to include such information for IPv4 net¬ 
works. IPv6 works somewhat differently, and we cover it in the following section. 

The RAIO for DHCPv4 specified in [RFC3046] is really a meta-option, in 
the sense that it specifies a framework in which a number of suboptions can be 
defined. Many such suboptions have been defined, including several that are used 
by ISPs to identify from which user, circuit, or network a request is coming. In 
many cases we shall see that a suboption of the DHCPv4 information option has a 
corresponding IPv6 option. 

Because some of the information conveyed between a relay and a server may 
be important to secure, the DHCP Authentication suboption of the RAIO has been 
defined in [RFC4030]. It provides a method to ensure data integrity of the mes¬ 
sages exchanged between relay and server. The approach is very similar to the 
DHCP deferred authentication method (see Section 6.2.7), except the SHA-1 algo¬ 
rithm is used instead of the MD5 algorithm (see Chapter 18). 

6.2.6.2 Relay Agent Remote-ID Suboption and IPv6 Remote-ID Option 

One common requirement placed upon a relay is to identify the client making a 
DHCP request with information beyond what the client itself provides. A sub¬ 
option of the Relay Agent Information option, called the Remote-ID suboption, 
provides a way to identify the requesting DHCP client using a number of nam¬ 
ing approaches that are locally interpreted (e.g., caller ID, user name, modem ID, 
remote IP address of a point-to-point link). The DHCPv6 Relay Agent Remote-ID 
option [RFC4649] provides the same capability but also includes an extra field, 
the enterprise number, which indicates the vendor associated with the identify¬ 
ing information. This format of the Remote-ID information is then specified in a 
vendor-specific way based on the enterprise number. A common method is to use 
a DUID for the remote ID. 

6.2.6.3 Server Identifier Override 

In some cases a relay may wish to interpose itself for processing between a 
DHCP client and server. This can be accomplished with a special Server Identifier 
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Override suboption [RFC5107]. The suboption is a variant of the RAIO mentioned 
previously. 

Ordinarily, a relay forwards SOLICIT messages and may append options to 
these messages as they pass from client to server. Relays are necessary in this cir¬ 
cumstance because the client is likely to not yet have an acceptable IP address and 
only sends its messages to the local subnet using broadcast or multicast address¬ 
ing. Once a client receives and selects its address, it can talk directly to the DHCP 
server based upon the server's identity carried in the Server Identifier option. In 
effect, this cuts the relay out of subsequent transactions between client and server. 

It is often useful to allow the relay to include a variety of options (e.g., RAIO 
carrying a circuit ID) for other types of messages, such as REQUEST, in addition 
to SOLICIT. This option includes a 4-byte value specifying the IPv4 address to use 
in the Server Identifier option present in DHCPREPLY messages formed by serv¬ 
ers. The Server Identifier Override option is supposed to be used in conjunction 
with the Relay Agents Flag suboption [RFC5010]. This suboption of the RAIO is a 
set of flags that carry information from relay to server. So far, only one such flag 
is defined: whether the destination address on the initial message from the client 
used broadcast or unicast addressing. The server may make different address allo¬ 
cation decisions based upon the setting of this flag. 

6.2.6.4 Lease Query and Bulk Lease Query 

In some environments it is useful to allow a third-party system (such as a relay 
or access concentrator) to learn the address bindings for a particular DHCP client. 
This facility is provided by DHCP leasequery ([RFC4388][RFC6148] for DHCPv4 
and [RFC5007] for DHCPv6). In the case of DHCPv6, it can also provide lease 
information for delegated prefixes. In Figure 6-21, the relay agent may "glean" 
information from DHCP packets that pass through it in order to influence what 
information is provided to the DHCP server. Such information may be kept by the 
relay but may be lost upon relay failure. The DHCPLEASEQUERY message allows 
such an agent to reacquire this type of information on demand, usually when 
relaying traffic for which it has lost a binding. The DHCPLEASEQUERY message 
supports four types of queries for DHCPv4: IPv4 address, MAC address. Client 
Identifier, and Remote ID. For DHCPv6, it supports two: IPv6 address and Client 
Identifier (DUID). 

DHCPv4 servers may respond to lease queries with one of the follow¬ 
ing types of messages: DHCPLEASEUNASSIGNED, DHCPLEASEACTIVE, or 
DHCPLEASEUNKNQWN. The first message indicates that the responding server 
is authoritative for the queried value but no current associated lease is assigned. 
The second form indicates that a lease is active, and the lease parameters (includ¬ 
ing T1 and T2) are provided. There is no particular presumed use for this infor¬ 
mation; it is made available to the requestor for whatever purposes it desires. 
DHCPv6 servers respond with a LEASEQUERY-REPLY message that contains 
a Client Data option. This option, in turn, includes a collection of the following 
options: Client ID, IPv6 Address, IPv6 Prefix, and Client Last Transaction Time. 
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The last value is the time (in seconds) since the server last communicated with 
the client in question. A LEASEQUERY-REPLY message may also contain the fol¬ 
lowing two options: Relay Data and Client Link. The first includes the data last 
sent from a relay about the associated query, and the second indicates the link on 
which the subject client has one or more address bindings. Once again, this infor¬ 
mation is used for whatever purposes the requestor desires. 

An extension to lease query called Bulk Leasequery (BL) [REC5460][ID4LQ] 
allows multiple bindings to be queried simultaneously, uses TCP/IP rather than 
UDP/IP, and supports a wider range of query types. BL is designed as a special 
service for obtaining binding information and is not really part of conventional 
DHCP. Thus, clients wishing to obtain conventional configuration information do 
not use BL. One particular use of BL is when DHCP is being used for prefix del¬ 
egation. In this case, it is common for a router to be acting as a DHCP-PD client. It 
obtains a prefix and then provides an address from the address range represented 
by the prefix as an assignment to conventional DHCP clients. However, if such a 
router fails or reboots, it may lose the prefix information and have a difficult time 
recovering because the conventional lease query mechanism requires an identifier 
for the binding in order to form the query. BL helps this situation, and others, by 
generalizing the set of possible query types. 

BL provides several extensions to basic lease query. Pirst, it uses TCP/IP (port 
547 for IPv6 and port 67 for IPv4) instead of UDP/IP. This change allows for large 
amounts of query information to be returned for a single query, as may be neces¬ 
sary when retrieving a large number of delegated prefixes. BL also provides a Relay 
Identifier option to allow queries to identify the querier more easily. A BL query 
can then be based on relay identifier, link address (network segment), or relay ID. 

The Relay ID DHCPv6 option and Relay ID DHCPv4 suboption [ID4RI] may 
include a DUID that identifies the relay agent. Relays can insert this option in mes¬ 
sages they forward, and the server can use it to associate bindings it receives with 
the particular relay providing them. BL supports queries by address and DUID 
specified in [RPC5007] and [RPC4388] but also queries by relay ID, link address, 
and remote ID. These newer queries are supported only on TCP/IP-based servers 
that support BL. Conversely, BL servers support only LEASEQUERY messages, not 
the full set of ordinary DHCP messages. 

BL extends the basic lease query mechanism with the LEASEQUERY-DATA 
and LEASEQUERY-DONE messages. When responding successfully to a query, a 
server first includes a LEASEQUERY-REPLY message. If additional information is 
available, it includes a set of LEASEQUERY-DATA messages, one per binding, and 
completes the set with a LEASEQUERY-DQNE message. All messages pertaining 
to the same group of bindings share a common transaction ID, the same one pro¬ 
vided in the initial LEASEQUERY-REQUEST message. 

6.2.6.5 Layer 2 Relay Agents 

In some network environments, there are layer 2 devices (e.g., switches, bridges) 
that are located near end systems that relay and process DHCP requests. These 
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layer 2 devices do not have a full TCP/IP implementation stack and are not address¬ 
able using IP. As a result, they cannot act as conventional relay agents. To deal with 
this issue, [IDL2RA] and [RFC6221] specify how layer 2 "lightweight" DHCP relay 
agents (LDRAs) should behave, for IPv4 and IPv6, respectively. When referring to 
relay behaviors, interfaces are labeled as client-facing or network-facing, and as 
either trusted or untrusted. Network-facing interfaces are topologically closer to 
DHCP servers, and trusted interfaces are those where it is assumed that arriving 
packets are not spoofed. 

The primary issue for IPv4 LDRAs is how to handle the DHCP giaddr field and 
insert a RAIO when the LDRA itself has no IP layer information. The approach 
recommended by [IDL2RA] is to have LDRAs insert the RAIO into DHCP requests 
received from clients but not fill in the giaddr field. The resulting DHCP message 
is sent in a broadcast fashion to one or more DHCP servers, as well as any other 
receiving LDRAs. Such messages are flooded (i.e., sent on all interfaces except 
the one upon which the message was received) unless received on an untrusted 
interface. LDRAs receiving such a message already including a RAIO do not add 
another such option but perform flooding. Responses (e.g., DHCPOFFER mes¬ 
sages) sent using broadcast may be intercepted by the LDRA, which in turn strips 
the RAIO and uses its information to forward the response to the original request¬ 
ing client. Many LDRAs also intercept unicast DHCP traffic. In these cases, the 
RAIO is also created or stripped as necessary. Note that compatible DHCP serv¬ 
ers must support the ability to process and return DHCP messages containing 
RAIOs without a valid giaddr field, whether such messages are sent using unicast 
or broadcast. 

IPv6 LDRAs process DHCPv6 traffic by creating RELAY-FORW and RELAY- 
REPL messages. ADVERTISE, REPLY, RECONFIGURE, and RELAY-REPL mes¬ 
sages received on client-facing interfaces are discarded. In addition, RELAY-FORW 
messages received on untrusted client-facing interfaces are also discarded as a 
security precaution. RELAY-FORW messages are built containing options that 
identify the client-facing interface (i.e., Link-Address field, Peer-Address field, and 
Interface-ID option). The Link-Address field is set to 0, the Peer-Address field is set 
to the client's IP address, and the Interface-ID option is set to a value configured 
in the LDRA. When receiving a RELAY-REPL message containing a Link-Address 
field with value 0, the LDRA decapsulates the included message and sends it to 
toward the client on the interface specified in the received Interface-ID option 
(provided by the server). RELAY-FORW messages received on client-facing inter¬ 
faces are modified by incrementing the hop count. Messages other than RELAY- 
REPL messages received on network-facing interfaces are dropped. 

6.2.7 DHCP Authentication 

While we ordinarily discuss various security vulnerabilities at the end of each 
chapter (as we do in this one), for DHCP it is worth mentioning them here. It 
should be apparent that if the smooth operation of DHCP is interfered with, hosts 
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are likely to be configured with erroneous information and significant disruption 
could result. Unfortunately, as we have discussed so far, DHCP has no provision 
for security, so it is possible for unauthorized DHCP clients or servers to be set 
up, either intentionally or accidentally, that could cause havoc with an otherwise 
functioning network. 

in an attempt to mitigate these problems, a method to authenticate DHCP 
messages is specified in [RFC3118]. It defines a DHCP option, the Authentication 
option, with the format shown in Figure 6-22. 


0 1516 31 


Code 

Length 

Protocol 

Algorithm 

RDM 

Replay Detection 
(64 bits, based on RDM) 

Authentication Information 
(variable, based on Protocol) 


Figure 6-22 The DHCP Authentication option includes replay detection and can use various meth¬ 
ods for authentication. Specified back in 2001, this option is not widely used today. 


The purpose of the Authentication option is to help determine whether a 
DHCP message has come from an authorized sender. The Code field is set to 90, 
and the Length field gives the number of bytes in the option (not including the 
Code or Length fields). If the Protocol and Algorithm fields have the value 0, the 
Authentication Information field holds a simple shared configuration token. As long as 
the configuration token matches at the client and server, the message is accepted. 
This could be used, for example, to hold a password or similar text string, but such 
traffic could be intercepted by an attacker, so this method is not very secure. It 
might help to fend off accidental DHCP problems, however. 

A somewhat more secure method involves so-called deferred authentication, 
indicated if the Protocol and Algorithm fields are set to 1. In this case, the client's 
DHCPDISCOVER or DHCPINFORM message includes an Authentication option, 
and the server responds with authentication information included in its DHCPOF- 
FER or DHCPACK message. The authentication information includes a message 
authentication code (MAC; see Chapter 18), which provides authentication of the 
sender and an integrity check on the message contents. Assuming that the server 
and client have a shared secret, the MAC can be used to ensure that the client is 
trusted by the server and vice versa. It can also be used to ensure that the DHCP 
messages exchanged between them have not been modified or replayed from an 
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earlier DHCP exchange. The replay detection method (RDM) is determined by 
the value of the RDM field. For RDM set to 0, the Replay Detection field contains a 
monotonically increasing value (e.g., timestamp). Received messages are checked 
to ensure that this value always increases. If the value does not increase, it is 
likely that an earlier DHCP message is simply being replayed (captured, stored, 
and played back later). It is conceivable that the value in the Replay Detection field 
could fail to advance in a situation where packets are reordered, but this is highly 
unlikely in a LAN (where DHCP is most prevalent) because only a single routing 
path is ordinarily used between the DHCP client and server. 

There are (at least) two reasons why DHCP authentication has not seen wide¬ 
spread use. First, the approach requires shared keys to be distributed between a 
DHCP server and each client requiring authentication. Second, the Authentication 
option was specified after DHCP was already in relatively widespread use. None¬ 
theless, [RFC4030] builds upon this specification to help secure DHCP messages 
passed through relay agents (see Section 6.2.6). 

6.2.8 Reconfigure Extension 

In ordinary operation, a DHCP client initiates the renewal of address bindings. 
[RFC3203] defines the reconfigure extension and associated DHCPFORCERENEW 
message. This extension allows a server to cause a single client to change to the 
Renewing state and attempt to renew its lease by an otherwise ordinary opera¬ 
tion (i.e., DHCPREQUEST). A server that does not wish to renew the lease for the 
requested address may respond with a DHCPNAK, causing the client to restart 
in the INIT state. The client would then begin again using a DHCPDISCOVER 
message. 

The purpose of this extension is to cause the client to reestablish an address or 
to cause it to lose its address as the result of some significant change of state within 
the network. This could happen, for example, if the network is being adminis¬ 
tratively taken down or renumbered. Because this message is such an obvious 
candidate for a DoS attack, it must be authenticated using DHCP authentication. 
Because DHCP authentication is not in widespread use, neither is the reconfigure 
extension. 

6.2.9 Rapid Commit 

The DHCP Rapid Commit option [RFC4039] allows a DHCP server to respond 
to the DHCPDISCOVER message with a DHCPACK, effectively skipping the 
DHCPREQUEST message and ultimately using a two-message exchange instead 
of a four-message exchange. The motivation for this option is to quickly configure 
hosts that may change their point of network attachment frequently (i.e., mobile 
hosts). When only a single DHCP server is available and addresses are plentiful, 
this option should be of no significant concern. 
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To use rapid commit, a client includes the option in a DHCPDISCOVER mes¬ 
sage; it is not permitted to include it in any other message. Similarly, a server uses 
this option only in DHCPACK messages. When a server responds with this option, 
the receiving client knows that the returned address may be used immediately. If 
it should determine later that the address is already in use by another system (e.g., 
via ARP), the client sends a DHCPDECLINE message and abandons the address. It 
may also voluntarily relinquish the address it has received using a DHCPRELEASE 
message. 

6.2.10 Location Information (LCI and LoST) 

In some cases, it is useful for a host being configured to become aware of its loca¬ 
tion in the world. Such information may be encoded using, for example, latitude, 
longitude, and altitude. An lETE effort known as Geoconf ("Geographic configu¬ 
ration") resulted in [REG6225], which specifies how to provide such geospatial 
Location Configuration Information (LGI) to clients using the GeoGonf (123) and 
GeoLoc (144) DHGP options. Geospatial LGI includes not only the value of the lati¬ 
tude, longitude, and altitude coordinates, but also resolution indicators for each. 
LGI can be used for a number of purposes, including emergency services. If a 
caller using an IP phone requests emergency assistance, LGI can be used to indi¬ 
cate where the emergency is taking place. 

Although the physical location information just mentioned is useful to locate 
a particular individual or system, sometimes it is important to know the civic 
location of an entity. The civic location expresses location in terms of geopoliti¬ 
cal institutions such as country, city, district, street, and other such parameters. 
Givic location information can be provided using DHGP in the same way a phys¬ 
ical location can, using the same LGI structure as is used with geospatial LGI. 
[RPG4776] defines the GEOGONP_GIVIG (99) option for carrying civic location 
LGI. This form of LGI is trickier than the geospatial information because the geo¬ 
political method for naming locations varies by country. An additional complexity 
arises because such names may also require languages and character sets beyond 
the English and ASGII language and characters ordinarily used with DHGP. There 
is also a concern regarding the privacy of location in general, not just with respect 
to DHGP. The lETP is undertaking this issue in a framework called "Geopriv." See, 
for example, [RPG3693] for more information. 

An alternative high-layer protocol known as the HTTP-Enabled Location Deliv¬ 
ery (HELD) protocol [REG5985] may also be used to provide location information. 
Instead of encoding the LGI directly in DHGP messages, DHGP options OPTION_ 
V4_AGGESS_DOMAIN (213) and OPTION_V6_AGGESS_DOMAIN (57) provide 
the PQDN of a HELD server for IPv4 and IPv6, respectively [RPG5986]. 

Once a host knows its location, it may need to contact services associated with 
the location (e.g., the location of the nearest hospital). The lETP Location-to-Service 
Translation (LoST) framework [RPG5222] accomplishes this using an application- 
layer protocol accessed using a location-dependent URL The DHGP options 
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OPTION_V4_LOST (137) and OPTION_V6_LOST (51) provide for variable-length 
encodings of an FQDN specifying the name of a LoST server for DHCPv4 and 
DHCPv6, respectively [RFC5223]. The encoding is in the same format used by 
DNS for encoding domain names (see Chapter 11). 

6.2.11 Mobility and Handoff Information (MoS and ANDSF) 

In response to the increased use of mobile computers and smartphones accessing 
the Internet with cellular technology, frameworks and related DHCP options have 
been specified to convey information about the cellular configuration and hand¬ 
overs between different wireless networks. At present, there are two sets of DHCP 
options relating to this information: IEEE 802.21 Mobility Services (MoS) Discovery 
and Access Network Discovery and Selection Function (ANDSE). The latter framework 
is being standardized by the 3rd Generation Partnership Project (3GPP), one of the 
organizations responsible for creating cellular data communications standards. 

The IEEE 802.21 standard [802.21-2008] specifies a framework for media- 
independent handoff (MIH) services between various network types, including 
those defined by IEEE (802.3, 802.11, 802.16), those defined by 3GPP, and those 
defined by 3GPP2. A design of such a framework in the lETE context is provided 
in [REG5677]. MoS provides three types of services known as information ser¬ 
vices, command services, and event services. Roughly speaking, these services 
provide information about available networks, functions for controlling link 
parameters, and notification of link status changes. The MoS Discovery DHGP 
options [REG5678] provide a means for a mobile node to acquire the addresses or 
domain names of servers providing each of these services using either DHGPv4 
or DHGPv6. Eor IPv4, the OPTION-IPv4_Address-MoS option (139) contains a 
vector of suboptions containing IP addresses for servers providing each of the 
services. A suboption of the OPTION-IPv4_FQDN-MoS option (140) provides a 
vector of EQDNs for servers for each of the services. Similar options, OPTION- 
IPv6_Address-MoS (54) and OPTION-IPv6_FQDN (55), provide equivalent capa¬ 
bilities for IPv6. 

Based upon 3GPP's ANDSF specification, [RFG6153] defines DHGPv4 and 
DHGPv6 options for carrying ANDSF information. In particular, it defines options 
for mobile devices to discover the address of an ANDSF server. ANDSF servers 
are configured by cellular infrastructure operators and may hold information 
such as the availability and access policies of multiple transport networks (e.g., 
simultaneous use of 3G and Wi-Fi). 

The ANDSF IPv4 Address Option (142) contains a vector of IPv4 addresses for 
ANDSF servers. The addresses are provided in preference order (first is most pre¬ 
ferred). The ANDSF IPv6 Address Option (143) contains a vector of IPv6 addresses 
for ANDSF servers. To request ANDSF information using DHGPv4, the mobile node 
includes an ANDSF IPv4 Address option in the Parameter Request List. To request 
ANDSF information using DHGPv6, the client includes an ANDSF IPv6 Address 
option in the Option Request Option (ORO) (see Section 22.7 of [RFG3315]). 
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6.2.12 DHCP Snooping 

DHCP "snooping" is a capability that some switch vendors offer in their prod¬ 
ucts that inspects the contents of DHCP messages and ensures that only those 
addresses listed on an access control list are able to exchange DHCP traffic. This 
can help to protect against two potential problems. First, a "rogue" DHCP server is 
limited in the damage it can do because other hosts are not able to hear its DHCP 
address offers. Also, the technique can limit the allocation of addresses to a partic¬ 
ular set of MAC addresses. While this provides some protection, MAC addresses 
can be changed in a system fairly easily using operating system commands, so 
this technique offers only limited protection. 


6.3 Stateless Address Autoconfiguration (SLAAC) 

While most routers have their addresses configured manually, hosts can be 
assigned addresses manually, using an assignment protocol like DHCP, or auto¬ 
matically using some sort of algorithm. There are two forms of automatic assign¬ 
ment, depending on what type of address is being formed. For addresses that are 
to be used only on a single link (link-local addresses), a host need only find some 
appropriate address not already in use on the link. For addresses that are to be 
used for global connectivity, however, some portion of the address must generally 
be managed. There are mechanisms in both IPv4 and IPv6 for link-local address 
autoconfiguration, whereby a host determines its address (es) largely without help. 
This is called stateless address autoconfiguration (SLAAC). 

6.3.1 Dynamic Configuration of iPv4 Link-Locai Addresses 

In cases where a host without a manually configured address attaches to a network 
lacking a DHCP server, IP-based communication is unable to take place unless 
the host somehow generates an IP address to use. [RFC3927] describes a mecha¬ 
nism whereby a host can automatically generate its own IPv4 address from the 
link-local range 169.254.1.1 through 169.254.254.254 using the 16-bit subnet mask 
255.255.0.0 (see [RFC5735]). This method is known as dynamic link-local address 
configuration or Automatic Private IP Addressing (APIPA). In essence, a host selects 
a random address in the range to use and checks to see if that address is already 
in use by some other system on the subnetwork. This check is implemented using 
IPv4 ACD (see Chapter 4). 

6.3.2 IPv6 SLAAC for Link-Local Addresses 

The goal of IPv6 SLAAC is to allow nodes to automatically (and autonomously) 
self-assign link-local IPv6 addresses. IPv6 SLAAC is described in [RFC4862]. It 
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involves three major steps: obtaining a link-local address, obtaining a global 
address using stateless autoconfiguration, and detecting whether the link-local 
address is already in use on the link. Stateless autoconfiguration can be used with¬ 
out routers, in which case only link-local addresses are assigned. When routers are 
present, a global address is formed using a combinafion of fhe prefix adverfised 
by a roufer and locally generafed informafion. SLAAC can also be used in con- 
juncfion wifh DHCPv6 (or manual address assignmenf) fo allow a hosf fo obfain 
informafion in addifion fo ifs address (called "sfafeless" DHCPv6). Hosfs fhaf per¬ 
form SLAAC can be used on fhe same nefwork as fhose configured using sfafeful 
or sfafeless DHCPv6. Generally, sfafeful DHCPv6 is used when finer confrol is 
required in assigning address fo hosfs, buf if is expecfed fhaf sfafeless DHCPv6 in 
combinafion wifh SLAAC will be fhe mosf common deploymenf opfion. 

In IPv6, fenfafive (or opfimisfic) link-local addresses are selecfed using proce¬ 
dures specified in [RFC4291] and [RFC4941]. They apply only fo mulficasf-capable 
nefworks and are assigned infinife preferred and valid lifefimes once esfablished. 
To form fhe numeric address, a unique number is appended fo fhe well-known 
link-local prefix fe80::0 (of appropriafe lengfh). This is accomplished by selling 
fhe righf-mosf N bifs of fhe address fo be equal fo fhe (N-bif-long) number, fhe 
leff-mosf bifs equal fo fhe 10-bif link-local prefix 1111111010, and fhe resf fo 0. The 
resulfing address is placed info fhe fenfafive (or opfimisfic) sfafe and checked for 
duplicafes (see fhe nexf secfion). 

6.3.2.1 IPv6 Duplicate Address Detection (DAD) 

IPv6 DAD uses ICMPv6 Neighbor Solicifafion and Neighbor Adverfisemenf mes¬ 
sages (see Chapfer 8) fo defermine if a parficular (fenfafive or opfimisfic) IPv6 
address is already in use on fhe attached link. For purposes of fhis discussion, 
we refer only fo fenfafive addresses, buf if is undersfood fhaf DAD applies fo opfi¬ 
misfic addresses as well. DAD is specified in [RFC4862] and is recommended fo 
be used every lime an IPv6 address is assigned fo an inferface manually, using 
aufoconfigurafion, or using DHCPv6. If a duplicafe address is discovered, fhe pro¬ 
cedure causes fhe fenfafive address fo nof be used. If DAD succeeds, fhe fenfafive 
address fransifions fo fhe preferred sfafe and can be used wifhouf resfricfion. 

DAD is performed as follows: A node firsf joins fhe All Nodes mulficasf address 
and fhe Solicifed-Node mulficasf address of fhe fenfafive address (see Chapfer 9). 
To check for use of an address duplicafe, a node sends one or more ICMPv6 Neigh¬ 
bor Solicifafion messages. The source and desfinafion IPv6 addresses of fhese mes¬ 
sages are fhe unspecified address and Solicifed-Node address of fhe fargef address 
being checked, respecfively. The Target Address field is sef fo fhe address being 
checked (fhe fenfafive address). If a Neighbor Adverfisemenf message is received 
in response, DAD has failed, and fhe address being checked is abandoned. 
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Note 

As a consequence of joining multicast groups, MLD messages are sent (see 
Chapter 9), but their transmission is delayed by a random interval according to 
[RFC4862] to avoid congesting the network when many nodes simultaneously 
join the All Hosts group (e.g., after a restoration of power). For DAD, these MLD 
messages are used to inform MLD-snooping switches to forward multicast traffic 
as necessary. 


When an address has not yet successfully completed DAD, any received 
neighbor solicitations for it are treated in a special way, as this is indicative of 
some other host's intention to use the same address. If such messages are received, 
they are dropped, the current tentative address is abandoned, and DAD fails. 

If DAD fails, by receiving a similar neighbor solicitation from another node 
or a neighbor advertisement for the target address, the address is not assigned to 
an interface and does not become a preferred address. If the address is a link-local 
address being configured based on an interface identifier derived from a local 
MAC address, it is unlikely that the same procedure will ultimately produce a 
nonconflicting address, so the use of this address is abandoned and administrator 
input is required. If the address is based on a different form of interface identi¬ 
fier, IPv6 operations may be retried using another address based on an alternative 
tentative address. 

6.3.2.2 IPv6 SLA AC for Global Addresses 

Once a node has acquired a link-local address, it is likely to require one or more 
global addresses as well. Global addresses are formed using a process similar to 
that for link-local SLAAC but using a prefix provided by a router. Such prefixes 
are carried in the Prefix option of a router advertisement (see Chapter 8), and a 
flag indicates whether the prefix should be used in forming global addresses with 
SLAAC. If so, the prefix is combined with an interface identifier (e.g., the same one 
used in forming a link-local address if the privacy extension is not being used) to 
form a global address. The preferred and valid lifetimes of such addresses are also 
determined by information present in the Prefix option. 

6.3.2.3 Example 

The trace in Figure 6-23 shows the series of events an IPv6 (Windows Vista/SPl) 
host uses when allocating its addresses with SLAAC. The system first selects a 
link-local address based on the link-local prefix of fe80: :/64 and a random number. 
This method is designed to enhance the privacy of a user by making the address of 
the host system change over time [RFC4941]. The other common method involves 
using the bits of the MAC address in forming the link-local address. It performs 
DAD on this address (fe80::fd26:de93:5ab7:405a) to look for conflicts. 
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a Frame 1: 78 bytes on wire (624 bits), 78 bytes captured (624 bits) 

ffl Ethernet II, Src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: 33:33:ff:b7:40:5a (33:33:ff:b7:40:5a) 
IS Internet Protocol version 6, Src: :: (::), Dst: ff02::l:ffb7:405a (ff02::l:ffb7:405a) 

S Internet Control Message Protocol v6 
Type: 135 (Neighbor solicitation) 

Code: 0 
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Reserved: 0 (Should always be zero) 

Target: feSO::fd26:de93:5ab7:405a (fe80::fd26:de93:5ab7:405a) 


Figure 6-23 


During SLAAC, a host begins by performing DAD on the tentative link-local address it wishes to use by sending an ICMPv6 Neighbor Solicita¬ 
tion message for this address from the unspecified address. 
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Figure 6-23 shows the operation of DAD, which involves the host sending an 
NS to see if its selected link-local address is in use. It then quickly performs an RS 
to determine how to proceed (see Figure 6-24). 
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S Frame 2: 70 bytes on wire (560 bits), 70 bytes captured (560 bits) 

a Ethernet II, Src: 00:13:02:20:b9:18 (00:13:02:20:b9;18), Dst: 33:33:00:00:00:02 (33:33:00:00:00:02) 
a internet protocol version 6, src: fe80::fd26:de93:5ab7:405a (fe80::fd26:de93:5ab7:405a), Dst: ff02::2 (ff02::2) 
Q Internet Control Message Protocol v6 
Type: 133 (Router solicitation) 
code: 0 

Checksum: 0x4al6 [correct] 

□ ICMPV6 option (source link-layer address) 

Type: source link-layer address (1) 

Length: 8 

Link-layer address: 00:13:02:20:b9:18 


Figure 6-24 The ICMPv6 RS message induces a nearby router to supply configuration information such as the 
global network prefix in use on the attached network. 


The Router Solicitation message shown in Figure 6-24 is sent to the All Rout¬ 
ers multicast address (ff02::2) using the autoconfigured link-local IPv6 address as 
a source address. The response is given in an RA sent to the All Systems multicast 
address (ff02::l), so that all attached systems can see (see Figure 6-25). 

The RA shown in Figure 6-25 is sent from fe80::204:5aff:fe9f:9e80, the link- 
local address of the router, to the All Systems multicast address ff02::l. The Flags 
field in the RA, which may contain several configuration options and extensions 
[RFC5175], is set to 0, indicating that addresses are not "managed" on this link 
by DHCPv6. The Prefix option indicates that the global prefix 2001:db8::/64 is 
in use on the link. The prefix length of 64 is not carried but is instead defined 
according to [RFC4291]. The Flags field value of OxcO associated with the Pre¬ 
fix option indicates that the prefix is on-link (can be use in conjunction with a 
router) and the auto flag is set, meaning that the prefix can be used by the host 
to configure other addresses automatically. It also includes the Recursive DNS 
Server (RDNSS) option [RFC6106], which indicates that a DNS server is available 
at the address 2001::db8::l. The SLLAO indicates that the router's MAC address is 
00:04:5a:9f:9e:80. This information is made available for any node to populate its 
neighbor cache (the IPv6 equivalent of the IPv4 ARP cache; Neighbor Discovery is 
discussed in Chapter 8). 
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> 

IS Frame 3: 134 bytes on wire (1072 bits), 134 bytes captured (1072 bits) 

SI Ethernet II, Src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 33:33:00:00:00:01 (33:33:00:00:00:01) 
a Internet Protocol version 6, src: fe80::204:5aff:fe9f:9e80 (fe80::204:5aff:fe9f:9e80), Dst: ff02::l (ff02::l) 
S Internet control Message Protocol v6 
Type: 134 (Router advertisement) 
code: 0 

Checksum: 0x079c [correct] 

Cur hop limit: 64 
B Flags: 0x00 

0. = Not managed 

.0.= Not other 

..0.= Not Home Agent 

...00... = Router preference: Medium 

.0.. = Not Proxied 

Router lifetime: 1000 
Reachable time: 1000 
Retrans timer: 0 

Q ICMPV6 Option (Prefix information) 

Type: Prefix information (3) 

Length: 32 
Prefix Length: 64 
B Flags: OxcO 

1. = on-link flag(L): set 

.1. = Autonomous address-configuration flag(A): set 

..00 0000 = Reserved: 0 
valid lifetime: infinity 
Preferred lifetime: infinity 
Reserved 

Prefix: 2001:db8:: 

B ICMPV6 option (Recursive DNS server) 

Type: Recursive DNS server (25) 

Length: 24 
Reserved 

Lifetime: infinity 

Recursive DNS Servers: 2001:db8::l (2001:db8::1) 

B ICMPV6 Option (source link-layer address) 

Type: source link-layer address (1) 

Length: 8 

Link-layer address: 00:04:5a:9f:9e:80 

> 


Figure 6-25 An ICMPv6 RA message provides the location and availability of a default router plus the global 
address prefix in use on the network. It also includes the location of a DNS server and indicates 
whether the router sending the advertisement can also act as a Mobile IPv6 home agent (no in this 
case). The client may use some or all of this information in configuring its operation. 


After an exchange of Neighbor Solicitation and Neighbor Advertisement mes¬ 
sages between the client and the router, the client performs another DAD opera¬ 
tion on the new (global) address it selects (see Figure 6-26). 

The address 2001:db8::fd26:de93:5ab7:405a has been chosen by the client 
based on the prefix 2001::db8 carried in the router advertisement it received ear¬ 
lier. The low-order bits of this address are based on the same random number as 
was used to configure its link-local address. As such, the Solicited-Node multicast 
address ff02::l:ffb7:405a is the same for DAD for both addresses. After this address 
has been tested for duplication, the client allocates another address and applies 
DAD to it (see Figure 6-27). 
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S) Frame 6: 78 bytes on wire (624 bits), 78 bytes captured (624 bits) 

s Ethernet ii, src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: 33:33:ff:b7:40:5a (33:33:ff:b7:40:5a) 
Si internet protocol version 6, src: :: (::), Dst: ff02: :l:ffb7:405a (ff02: :l:ffb7:405a) 

B Internet Control Message Protocol v6 
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code: 0 
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Reserved: 0 (should always be zero) 

Target: 2001:db8::fd26:de93:5ab7:405a (2001:db8::fd26:de93:5ab7:405a) 


Figure 6-26 DAD for the global address derived from the prefix 2001:db8::/64 is sent to the same 
Solicited-Node multicast address as the first packet. 
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Q Frame 7: 78 bytes on wire (624 bits), 78 bytes captured (624 bits) 

a Ethernet II, src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: 33:33:ff:6d:5c:97 (33:33:ff:6d:5c:97) 
a Internet Protocol version 6, src: :: (::), Dst: ff02::1:ff6d:5c97 (ff02::l:ff6d:5c97) 
a Internet control Message Protocol v6 
Type: 135 (Neighbor solicitation) 
code: 0 

checksum: 0x7cde [correct] 

Reserved: 0 (should always be zero) 

Target: 2001:db8::9cf4:f812:816d:5c97 (2001:db8::9cf4:f812:816d:5c97) 


Figure 6-27 DAD for the address 2001:db8::9cf4:f812:816d:5c97. 


The DAD operation in Figure 6-27 is for the address 2001:db8::9cf4:f812:816d: 
5c97. This address is a temporary IPv6 address, generated using a different ran¬ 
dom number for its lower-order bits for privacy reasons. The difference between 
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the two global addresses here is that the temporary address has a shorter lifetime. 
Lifetimes are computed as the lower (smaller) of fhe following fwo values: fhe life- 
fimes included in fhe Prefix Informafion opfion received in fhe RA and a local pair 
of defaulfs. In fhe case of Windows Visfa, fhe defaulf valid lifefime is one week and 
fhe defaulf preferred lifefime is one day. Once fhis message has complefed, fhe cli- 
enf has performed SLAAC for ifs link-local address, plus fwo global addresses. 
This is enough addressing informafion fo perform local or global communicafion. 
The femporary address will change periodically fo help enhance privacy. In cases 
where privacy profecfion is nof desired, fhe following command can be employed 
fo disable fhis feafure in Windows: 


C:\> netsh interface ipv6 set privacy state=disabled 

In Linux, femporary addresses can be enabled using fhis sef of commands: 

Linux# sysctl -w net.ipv6.conf.all.use_tempaddr=2 
Linux# sysctl -w net.ipv6.conf.default.use_tempaddr=2 

and disabled using fhese commands: 

Linux# sysctl -w net.ipv6.conf.all.use_tempaddr=0 
Linux# sysctl -w net.ipv6.conf.default.use_tempaddr=0 


6.3.2.4 Stateless DHCP 

We have menfioned fhaf DHCPv6 can be used in a "sfafeless" mode where fhe 
DHCPv6 server does nof assign addresses (or keep any per-clienf sfafe) buf 
does provide ofher configurafion informafion. Sfafeless DHCPv6 is specified in 
[RFC3736] and combines SLAAC wifh DHCPv6. If is believed fhaf fhis combi- 
nafion is an affracfive deploymenf opfion because nefwork adminisfrafors need 
nof be direcfly concerned wifh address pools as fhey have been when deploying 
DHCPv4. 

In a sfafeless DHCPv6 deploymenf, nodes are assumed fo have obfained fheir 
addresses using some mefhod ofher fhan DHCPv6. Thus, fhe DHCPv6 server does 
nof need fo handle any of fhe address managemenf messages specified in Table 
6-1. In addifion, if does nof need fo handle any of fhe opfions required for esfab- 
lishing lA bindings. This simplifies fhe server soffware and server configurafion 
considerably. The operafion of relay agenfs is unchanged. 

Sfafeless DHCPv6 clienfs use fhe DHCPv6 INFORMATION-REQUEST mes¬ 
sage fo requesf informafion fhaf is provided in REPLY messages from servers. The 
INFORMATION-REQUEST message includes an Qpfion Requesf opfion lisfing 
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the options about which the client wishes to know more. The INFORMATION- 
REQUEST may include a Client Identifier option, which allows answers to be cus¬ 
tomized for particular clients. 

To be a compliant stateless DHCPv6 server, a system must implement the fol¬ 
lowing messages: INEORMATION-REQUEST, REPLY, RELAY-PORW, and RELAY- 
REPL. It also must implement the following options: Option Request, Status Code, 
Server Identifier, Client Message, Server Message, Interface-ID. The last three 
are used when relay agents are involved. To be a useful stateless DHCPv6 server, 
several other options will likely be necessary: DNS Server, DNS Search List, and 
possibly SIP Servers. Other potentially useful, but not required, options include 
Preference, Elapsed Time, User Class, Vendor Class, Vendor-Specific Information, 
Client Identifier, and Authentication. 

6.3.2.5 The Utility of Address Autoconfiguration 

The utility of address autoconfiguration for IP is typically limited because routers 
that may be on the same network as the client are configured with particular IP 
address ranges in use that differ from the addresses a client is likely to autoconfig- 
ure. This is especially true for the IPv4 (APIPA) case, as the private link-local prefix 
169.254/16 is very unlikely to be used by a router. Therefore, the consequence of 
self-assigning an IP address is that local subnet access may work, but Internet 
routing and name services (DNS) are likely to fail. When DNS fails, much of the 
common Internet "experience" fails with it. Thus, it is often more useful to have a 
client fail to get an IP address (which is relatively easily detected) than to allow it 
to obtain one that cannot really be used effectively. 


Note 

There are name services other than conventional DNS that may be of use for 
link-local addressing, including Bonjour/ZeroConf (Apple), LLMNR, and NetBIOS 
(Microsoft). Because these have evolved over time from different vendors, and 
are not established IETF standards, the exact behavior involved when mapping 
names to addresses in the local environment varies considerably. See Chapter 11 
for more details on local alternatives to DNS. 


The use of APIPA can be disabled, which prevents a system from self-assign¬ 
ing an IP address. In Windows, this is accomplished by creating the following 
registry key (the key is a single line but is wrapped here for illustration): 


HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ 
IPAutoconfigurationEnabled 


This REG_DWORD value may be set to 0 to disable APIPA for all network inter¬ 
faces. In Linux, the file /etc/sysconfig/network can be modified to include 
the following directive: 


NOZEROCONF=Yes 
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This disables the use of APIPA for all nefwork inferfaces. If is also possible fo 
disable APIPA for specific inferfaces by modifying fhe per-inferface configura- 
fion files (e.g., /etc/sysconfig/network-scripts/ifcfg-ethO for fhe firsf 
Efhernef device). 

In fhe case of IPv6 SLAAC, if is relafively easy fo obfain a global IPv6 address, 
buf fhe relafionship befween a name and ifs address is nof secured, leading fo a 
pofenfial sef of unpleasanf consequences (see Chapfers 11 and 18). Thus, if may 
sfill be desirable fo avoid SLAAC in deploymenfs for fhe fime being. To disable 
SLAAC for IPv6 global addresses, fhere are fwo mefhods. Firsf, fhe Roufer Adver- 
fisemenf messages provided by fhe local roufer can be arranged fo furn off fhe 
"aufo" flag in fhe Prefix opfion (or configure if fo nof provide a Prefix opfion, as 
illusfrafed in fhe preceding example). In addifion, a local configurafion selling 
causes a clienf fo avoid aufoconfigurafion of global addresses. 

To disable SLAAC in a Linux clienf, fhe following command may be given: 


Linux# sysctl -w net.ipv6.conf.all.autoconf=0 


To do so on a Mac OS or FreeBSD sysfem, af leasf for link-local addresses, fhe fol¬ 
lowing command should be used: 


FreeBSD# sysctl -w net.inet6.ip6.auto_linklocal=0 


And, finally, for Windows: 


C:\> netsh 

netsh> interface ipv6 

netsh interface ipv6> set interface {ifname} managedaddress=disabled 


where {ifname} should be replaced wifh fhe appropriafe inferface name (in fhis 
example, "Wireless Network Connection"). Nofe fhaf fhe behavior of fhese 
configurafion commands somefimes changes over fime. Please check fhe operaf- 
ing sysfem documenfafion for fhe currenf mefhod if fhese changes do nof perform 
as expecfed. 


6.4 DHCP and DNS Interaction 

One of fhe imporfanf parfs of fhe configurafion informafion a DHCP clienf fypi- 
cally receives when obfaining an IP address is fhe IP address of a DNS server. This 
allows fhe clienf sysfem fo converf DNS names fo fhe IPv4 and/or IPv6 addresses 
required by fhe protocol implemenfafion fo make fransporf-layer connecfions. 
Wifhouf a DNS server or ofher way fo map names fo addresses, mosf users would 
find fhe sysfem nearly useless for accessing fhe Infernef. If fhe local DNS is work¬ 
ing properly, if should be able fo provide address mappings for fhe Infernef as a 
whole, buf also for local privafe nefworks (like . home menfioned earlier), if prop¬ 
erly configured. 
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Because DNS mappings for local private networks are cumbersome to manage 
by hand, it is convenient to couple the act of providing a DHCP-assigned address 
with a method for updating the DNS mappings corresponding to that address. 
This can be done either using a combined DHCP/DNS server or with dynamic DNS 
(see Chapter 11). 

A combined DNS/DHCP server (such as the Linux dnsmasq package) is a 
server program that can be configured to give out IP address leases and other 
information but that also reads the Client Identifier or Domain Name present in a 
DHCPREQUEST and updates an internal DNS database with the name-to-address 
binding before responding with the DHCPACK. In doing so, any subsequent DNS 
requests initiated either by the DHCP client or by other systems interacting with 
the same DNS server are able to convert between the name of the client and its 
freshly assigned IP address. 


6.5 PPP over Ethernet (PPPoE) 

Por most LANs and some WAN connections, DHCP provides the most com¬ 
mon method for configuring client systems. Por WAN connections such as DSL, 
another method based on PPP is often used instead. This method involves carry¬ 
ing PPP on Ethernet and is called PPP over Ethernet (PPPoE). PPPoE is used in cases 
where the WAN connection device (e.g., DSL modem) acts as a switch or bridge 
instead of a router. PPP is preferred as a basis for establishing connectivity by 
some ISPs because it may provide finer-grain configuration control and audit logs 
than other configuration options such as DHCP. To provide Internet connectivity, 
some device such as a user's PC must implement the IP routing and addressing 
functions. Pigure 6-28 shows the typical use case. 



Figure 6-28 A simplified view of DSL service using PPPoE as provided to a customer. The home PC 
implements the PPPoE protocol and authenticates the subscriber with the ISP. It may 
also act as a router, DHCP server, DNS server, and/or NAT device for the home LAN. 
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The figure shows an ISP providing services to many customers using DSL. 
DSL provides a point-to-point digital link that can operate simultaneously with a 
conventional analog telephone line (called plain old telephone service or POTS). This 
simultaneous use of the customer's physical phone wires is accomplished using 
frequency division multiplexing—the DSL information is carried on higher fre¬ 
quencies than POTS. A filter is required when attaching conventional telephone 
handsets to avoid interference from the higher DSL frequencies. The DSL modem 
effectively provides a bridged service to a PPP port on the ISP's access concentrator 
(AC), which interconnects the customer's modem line and the ISP's networking 
equipment. The modem and AC also support the PPPoE protocol, which the user 
has elected in this example to configure on a home PC attached to the DSL modem 
using a point-to-point Ethernet network (i.e., an Ethernet LAN using only a single 
cable). 

Once the DSL modem has successfully established a low-layer link with the 
ISP, the PC can begin the PPPoE exchange, as defined in the informational docu¬ 
ment [REC2516] and shown in Eigure 6-29. 
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Figure 6-29 The PPPoE message exchange starts in a Discovery stage and establishes a PPP Session 
stage. Each message is a PAD message. PADI requests responses from PPPoE servers. 
PADO offers connectivity. PADR expresses the client's selection among multiple pos¬ 
sible servers. PADS provides an acknowledgment to the client from the selected server. 
After the PAD exchanges, a PPP session begins. The PPP session can be terminated by 
either side sending a PADT message or when the underlying link fails or is shut down. 
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The protocol includes a Discovery phase and a PPP Session phase. The Discov¬ 
ery phase involves the exchange of several PPPoE Active Discovery (PAD) messages: 
PADI (Initiation), PADO (Offer), PADR (Request), PADS (Session Confirmation). 
Once the exchange is complete, an Ethernet-encapsulated PPP session proceeds 
and ultimately concludes with either side sending a PADT (Termination) message. 
The session also concludes if the underlying connection is broken. PPPoE mes¬ 
sages use the format shown in Eigure 6-30 and are encapsulated in the Ethernet 
payload area. 


Set to 0x1 for This Version of 
PPPoE 


15 16 


31 


Ver ^ 
(4 bits) 

Type 
(4 bits) 

Code 
(8 bits) 

Session ID 

(16 bits, value 0 during Discovery) 

Length 

(16 bits, length of payload) 



Payioad (variable) 

[PAD Messages Contain TLV Tags in Payload Area] 


PPPoE Ethernet types 
0x8863 (Discovery) 
0x8864 (PPP Session) 


Code values 
0x09 (PADI) 

0x07 (PADO) 

0x19 (PADR) 

0x65 (PADS) 

0xA7 (PADT) 

0x00 (PPP session) 


Figure 6-30 PPPoE messages are carried in the payload area of Ethernet frames. The Ethernet Type 
field is set to 0x8863 during the Discovery phase and 0x8864 when carrying PPP session 
data. For PAD messages, a TLV scheme is used for carrying configuration information, 
similar to DHCP options. The PPPoE Session ID is chosen by the server and conveyed 
in the PADS message. 


In Eigure 6-30, the PPPoE Ver and Type fields are both 4 bits long and contain 
the value 0x1 for the current version of PPPoE. The Code field contains an indica¬ 
tion of the PPPoE message type, as shown in the lower right part of Eigure 6-30. 
The Session ID field contains the value 0x0000 for PADI, PADO, and PADR mes¬ 
sages and contains a unique 16-bit number in subsequent messages. The same 
value is maintained during the PPP Session phase. PAD messages contain one 
or more tags, which are TLVs arranged as a 16-bit TAGJTYPE field followed by a 
16-bit TAG_LENGTH field and a variable amount of tag value data. The values and 
meanings of the TAGJTYPE field are given in Table 6-2. 
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Table 6-2 PPPoE TAG_TYPE values, name, and purpose. PAD messages may contain one or more 
tags. 


Value 

Name 

Purpose 

0x0000 

End-of-List 

Indicates that no further tags are present. TAG_ 
LENGTH must be 0. 

0x0101 

Service-Name 

Contains a UTF-8-encoded service name (for ISP use). 

0x0102 

AC-Name 

Contains a UTP-8-encoded string identifying the 
access concentrator. 

0x0103 

Host-Uniq 

Binary data used by client to match messages; not 
interpreted by AC. 

0x0104 

AC-Cookie 

Binary data used by AC for DoS protection; echoed 
by client. 

0x0105 

Vendor-Specific 

Not recommended; see [RFC2516] for details. 

0x0110 

Relay-Session-ID 

May be added by a relay relaying PAD traffic. 

0x0201 

Service-Name-Error 

The requested Service-Name tag cannot be honored 
by AC. 

0x0202 

AC-System-Error 

The AC experienced an error in performing a 
requested action. 

0x0203 

Generic-Error 

Contains a UTF-8 string describing an 
unrecoverable error. 


To see PPPoE in action, we can monitor the exchange between a home system 
such as the home PC from Figure 6-28 and an access concentrator. The Discovery 
phase and first PPP session packet are shown in Figure 6-31. 

Figure 6-31 shows the expected exchange of PADI, PADO, PADR, and PADS 
messages. Each confains fhe Hosf-Uniq fag wifh value 9c3a0000. Messages coming 
from fhe concenfrafor also include fhe value 90084090400368-rback37.snfcca in fhe 
AC-Name fag. The PADS message can be seen in more defail in Figure 6-32. 

In Figure 6-32, fhe PADS message indicafes fhe esfablishmenf of a PPP ses¬ 
sion for fhe clienf and fhe use of fhe session ID Oxecbd. The AC-Name fag is also 
mainfained fo indicafe fhe originafing AC. The Discovery phase is now complefe, 
and a regular PPP session (see Chapfer 3) can commence. Figure 6-33 shows fhe 
firsf PPP session packef. 

The figure indicafes fhe beginning of fhe PPP Session phase wifhin fhe PPPoE 
exchange. The PPP session begins wifh link configurafion (PPP TCP) by fhe clienf 
sending a Configurafion Requesf (see Chapfer 3). If indicafes fhaf fhe clienf wishes 
fo use fhe Password Aufhenficafion Profocol, a relafively insecure mefhod, for 
aufhenficafing if self fo fhe AC. Once fhe aufhenficafion exchange is complefe and 
various link paramefers are exchanged (e.g., MRU), IPCP is used fo obfain and 
configure fhe assigned IP address. Nofe fhaf addifional configurafion informafion 
(e.g., IP addresses of fhe ISP's DNS servers) may need fo be obfained separafely 
and, depending on fhe ISP's configurafion, configured by hand. 
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Figure 6-32 The PPPoE PADS message confirms the association between the client and the access concentrator. 

This message also defines the session ID as Oxecbd, which is used in subsequent PPP session packets. 
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Figure 6-33 The first PPP message of the PPPoE session is a Configuration Request. The Ethernet type has changed 
to 0x8864 to indicate an active PPP session, and the Session ID is set to Oxecbd. In this case, the PPP 
client wishes to authenticate using the (relatively insecure) Password Authentication Protocol. 
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6.6 Attacks Involving System Configuration 

A wide variety of attacks can be mounted relating to system and network configu¬ 
ration. They range from deploying unauthorized clients or unauthorized servers 
that interfere with DHCP to various forms of DoS attacks that involve resource 
exhaustion, such as requesting all possible IP addresses a server may have to give 
out. Many of these problems are widespread because the older IPv4-based proto¬ 
cols used for address configuration were designed for networks where trust was 
assumed, and the newer ones have seen little deployment to date. (Secured deploy¬ 
ments are even rarer.) Therefore, none of these attacks are directly addressed by 
typical DHCP deployments, although link-layer authentication (e.g., WPA2 as 
used with Wi-Fi networks) helps to limit the number of unauthorized clients that 
are able to attach to a particular network. 

An effort is under way within the IETF to provide security for IPv6 Neighbor 
Discovery, which, when or if it is deployed, would directly impact the security 
of operating networks using SLAAC. The trust and threat assumptions are out¬ 
lined in [RFC3756] from 2004, and the Secure Neighbor Discovery (SEND) protocol 
is defined in [RFC3971]. SEND applies IPsec (see Chapter 18) to Neighbor Discov¬ 
ery packets, in combination with cryptographically generated addresses (CGAs) 
[RFC3972]. Such addresses are derived from a keyed hash function, so they can be 
generated only by a system holding the appropriate key material. 


6.7 Summary 

A basic set of configuration information is required for a host or router to operate 
on the Internet or on a private network using Internet protocols. At a minimum, 
routers typically require the assignment of addressing information, whereas hosts 
require addresses, a next-hop router, and the location of a DNS server. DHCP is 
available for both IPv4 as well as IPv6, but the two are not directly interoperable. 
DHCP allows appropriately configured servers to lease one or more addresses to 
requesting clients for a defined period of time. Clients renew their leases if they 
require ongoing use. DHCP can also be used by the client to acquire additional 
information, such as the subnet mask, default routers, vendor-specific configura¬ 
tion information, DNS server, home agents, and default domain name. DHCP can 
be used through relay agents when a client and server are located on different net¬ 
works. Several extensions to DHCP allow for additional information to be carried 
between a relay agent and server when this is used. DHCPv6 can also be used to 
delegate a range of IPv6 address space to a router. 

With IPv6, a host typically uses multiple addresses. An IPv6 client is able to 
generate its link-local address autonomously by combining a special link-local 
IPv6 prefix with other local information such as bits derived from one of its 
MAC addresses or from a random number to help promote privacy. To obtain 
a global address, it can obtain a global address prefix from either ICMP Router 



Section 6.8 References 


293 


Advertisement messages or from a DHCPv6 server. DHCPv6 servers may operate 
in a "stateful" mode, in which they lease IPv6 addresses to requesting clients, or a 
"stateless" mode, in which they provide configuration information other than the 
addresses. 

PPPoE carries PPP messages over Ethernet to establish Internet connectiv¬ 
ity with ISPs, especially those ISPs that provide service using DSL. When using 
PPPoE, a user usually has a DSL modem with an Ethernet port acting as a bridge 
or switch. PPPoE first exchanges a set of Discovery messages to determine the 
identity of an access controller and establish a PPP session. After the Discovery 
phase is successfully completed, PPP traffic, which can be encapsulated in Eth¬ 
ernet and carry various protocols such as IP, may continue until the PPPoE asso¬ 
ciation is terminated, either intentionally or as a result of disconnection of the 
underlying link. When PPPoE is used, the PPP protocol's configuration capabili¬ 
ties such as IPCP (discussed in Chapter 3) are ultimately responsible for assigning 
the IP address to the client system. 

DHCP and the ICMPv6 router advertisements used with IPv6 stateless auto¬ 
configuration are ordinarily deployed without security mechanisms being applied 
to them. Because of this, they are susceptible to a number of attacks, including net¬ 
work access by unauthorized clients, operation of rogue DHCP servers that give 
out bogus addresses and cause various forms of denial of service, and resource 
exhaustion attacks in which a client may request more addresses than are avail¬ 
able. Most of these attacks can be mitigated by security mechanisms that have 
been added to DHCP such as DHCP authentication and the relatively recent SEND 
protocol. However, these are not commonly found in operation today. 
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Firewalls and Network Address 
Translation (NAT) 


7.1 Introduction 

During the early years of the Internet and its protocols, most network designers 
and developers were from universities or other entities engaged in research. These 
researchers were generally friendly and cooperative, and the Internet system was 
not especially resilient to attack, but not many people were interested in attack¬ 
ing it, either. By the late 1980s and especially the early to mid-1990s the Internet 
had gained the interest of the mass population and ultimately people interested 
in compromising its security. Successful attacks became commonplace, and many 
problems were caused by bugs or unplanned protocol operations in the software 
implementations of Internet hosts. Because some sites had a large number of end 
systems with various versions of operating system software, it became very dif¬ 
ficult for system administrators to ensure that all the various bugs in these end 
systems had been fixed. Furthermore, for obsolete systems, this task was all but 
impossible. Fixing the problem would have required a way to control the Internet 
traffic to which the end hosts were exposed. Today, this is provided by a firewall — 
a type of router that restricts the types of traffic it forwards. 

As firewalls were being deployed to protect enterprises, another problem 
was becoming important: the number of available IPv4 addresses was dimin¬ 
ishing, with a threat of exhaustion. Something would have to be done with the 
way addresses were allocated and used. One of the most important mechanisms 
developed to deal with this, aside from IPv6, is called Network Address Translation 
(NAT). With NAT, Internet addresses need not be globally unique, and as a conse¬ 
quence they can be reused in different parts of the Internet, called address realms. 
Allowing the same addresses to be reused in multiple realms greatly eased the 
problem of address exhaustion. As we shall see, NAT can also be synergistically 
combined with firewalls to produce combination devices that have become the 
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most popular types of routers used to connect end users, including home net¬ 
works and small enterprises, to the Internet. We shall now explore both firewalls 
and NATs in furfher defail. 


7.2 Firewalls 

Given fhe enormous managemenf problems associafed wifh frying fo keep end 
sysfem soffware up-fo-dafe and bug-free, fhe focus of resisfing affacks expanded 
from securing end sysfems fo resfricfing fhe Infernef fraffic allowed fo flow fo end 
sysfems by filfering ouf some fraffic using firewalls. Today, firewalls are common, 
and several differenf fypes have evolved. 

The fwo major fypes of firewalls commonly used include proxy firewalls and 
packet-filtering firewalls. The main difference befween fhem is fhe layer in fhe pro¬ 
tocol slack af which fhey operate, and consequenfly fhe way IP addresses and porf 
numbers are used. The packef-filfering firewall is an Infernef roufer fhaf drops 
dafagrams fhaf (fail fo) meef specific criferia. The proxy firewall operafes as a 
mulfihomed server hosf from fhe viewpoinf of an Infernef clienf. Thai is, if is fhe 
endpoinf of TCP and UDP fransporf associafions; if does nof fypically roufe IP 
dafagrams af fhe IP profocol layer. 

7.2.1 Packet-Filtering Firewalls 

Packet-filtering firewalls act as Internet routers and filter (drop) some traffic. They 
can generally be configured to discard or forward packets whose headers meet 
(or fail to meet) certain criteria, called filters. Simple filters include range compari¬ 
sons on various parts of the network-layer or transport-layer headers. The most 
popular filters involve undesired IP addresses or options, types of ICMP mes¬ 
sages, and various UDP or TCP services, based on the port numbers contained in 
each packet. As we shall see, the simplest packet-filtering firewalls are stateless, 
whereas the more sophisticated ones are stateful. Stateless packet-filtering fire¬ 
walls treat each datagram individually, whereas stateful firewalls are able associ¬ 
ate packets with either previously observed packets or packets that arrive in the 
future to make inferences about datagrams or streams—either those belonging to 
a single transport association or those IP fragments that constitute an IP datagram 
(see Chapter 10). IP fragmentation can significantly complicate a firewall's job, and 
stateless packet-filtering firewalls are easily confused by fragments. 

A typical packet-filtering firewall is shown in Figure 7-1. Here, the firewall is 
an Internet router with three network interfaces: an "inside," an "outside," and a 
third "DMZ" interface. The DMZ subnet provides access to an extranet or DMZ 
where servers are deployed for Internet users to access. Network administrators 
install filters or access control lists (ACLs, basically policy lists indicating what 
types of packets to discard or forward) in the firewall. Typically, these filters con¬ 
servatively block traffic from the outside that may be harmful and liberally allow 
traffic to travel from inside to outside. 
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Figure 7-1 A typical packet-filtering firewall configuration. The firewall acts as an IP router between 
an "inside" and an "outside" network, and sometimes a third "DMZ" or extranet net¬ 
work, allowing only certain traffic to pass through it. A common configuration allows 
all traffic to pass from inside to outside but only a small subset of traffic to pass in 
the reverse direction. When a DMZ is used, only certain services are permitted to be 
accessed from the Internet. 


7.2.2 Proxy Firewalls 

Packet-filtering firewalls act as routers that selectively drop packets. Other types 
of firewalls, called proxy firewalls, are not really Internet routers in the true sense. 
Instead, they are essentially hosts running one or more application-layer gateways 
(ALGs)—hosts with more than one network interface that relay traffic of certain 
types between one connection/association and another at the application layer. 
They do not typically perform IP forwarding as routers do, although more sophis¬ 
ticated proxy firewalls are now available that combine various functions. 

Figure 7-2 illustrates a proxy firewall. For this type of firewall, clients on the 
inside of the firewall are usually configured in a special way to associate (or con¬ 
nect) with the proxy instead of the actual end host providing the desired service. 
(Applications capable of operating with proxy firewalls this way include con¬ 
figuration options for it.) These firewalls act as multihomed hosts, and their IP 
forwarding capability, if present, is typically disabled. As with packet-filtering 
firewalls, a common configuration is to have an "outside" interface assigned a 
globally routable IP address and for its "inner" interface to be configured with a 
private IP address. Thus, proxy firewalls support the use of private address realms. 
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Figure 7-2 The proxy firewall acts as a multihomed Internet host, terminating TCP connections and 
UDP associations at the application layer. It does not act as a conventional IP router but 
rather as an ALG. Individual applications or proxies for each service supported must be 
enabled for communication to take place through the proxy firewall. 


While this type of firewall can be quife secure (some people believe fhis fype 
is fundamenfally more secure fhan packef-filfering firewalls), fhis securify comes 
af a cosf of briffleness and lack of flexibilify. In parficular, because fhis sfyle of 
firewall musf confain a proxy for each fransporf-layer service, any new services 
fo be used musf have a corresponding proxy insfalled and operafed for connec- 
fivify fo fake place fhrough fhe proxy. In addifion, each clienf musf fypically be 
configured fo find fhe proxy (e.g., using fhe Web Proxy Aufo-Discovery Profocol, 
or WPAD [XIDAD], alfhough fhere are some alfernafives—so-called capfuring 
proxies fhaf cafch all fraffic of a cerfain fype regardless of desfinafion address). 
Wifh respecf fo deploymenf, fhese firewalls work well in environmenfs where all 
fypes of nefwork services being accessed are known wifh cerfainfy in advance, 
buf fhey may require significanf infervenfion from nefwork operafors fo supporf 
addifional services. 

The fwo mosf common forms of proxy firewalls are HTTP proxy firewalls 
[RFC2616] and SOCKS firewalls [RFC1928]. The firsf type, also called Web proxies, 
work only for fhe HTTP and HTTPS (Web) protocols, buf because fhese protocols 
are so popular, such proxies are commonly used. These proxies acf as Web serv¬ 
ers for infernal clienfs and as Web clienfs when accessing exfernal Web sifes. Such 
proxies often also operate as Web caches. These caches save copies of Web pages so 
fhaf subsequenf accesses can be served direcfly from fhe cache instead of from fhe 
originafing Infernef Web server. Doing so can reduce lafency fo display Web pages 
and improve fhe experience of users accessing fhe Web. Some Web proxies are 
also used as content filters, which affempf fo block access fo cerfain Web sifes based 
on a "blacklisf" of prohibifed sifes. Conversely, a number of so-called tunneling 
proxy servers are available on fhe Infernef. These servers (e.g., psiphon, CGIProxy) 
essenfially perform fhe opposife funcfion—fo allow users fo avoid being blocked 
by confenf filters. 
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The SOCKS protocol is more generic than HTTP for proxy access and is appli¬ 
cable to more services than just the Web. Two versions of SOCKS are currenfly 
in use: version 4 and version 5. Version 4 provides fhe basic supporf for proxy 
fraversal, and version 5 adds sfrong aufhenficafion, UDP fraversal, and IPv6 
addressing. To use a SOCKS proxy, an applicafion musf be written fo use SOCKS 
(if musf be "socksified") and configured fo know abouf fhe locafion of fhe proxy 
and which version of SOCKS fo use. Once fhis is accomplished, fhe clienf uses fhe 
SOCKS protocol fo requesf fhe proxy fo perform nefwork connecfions and, opfion- 
ally, DNS lookups. 


7.3 Network Address Translation (NAT) 

NAT is essenfially a mechanism for allowing fhe same sefs of IP addresses fo be 
reused in differenf parfs of fhe Infernef. The primary mofivafion for fhe creafion 
of NAT was fhe limifed and diminishing availabilify of IP address space. The mosf 
common use case for a NAT is when a sife wifh a single Infernef connecfion is 
assigned a small range of IP addresses (perhaps only a single address), buf fhere 
are mulfiple compufers requiring Infernef access. When all incoming and oufgo- 
ing fraffic passes fhrough a single NAT device fhaf parfifions fhe inside (private) 
address realm from fhe global Infernef address realm, all fhe infernal systems 
can be provided Infernef connecfivify as clienfs using locally assigned, private IP 
addresses. Allowing privately addressed sysfems fo offer services on fhe Infernef, 
however, is somewhaf more complicafed. We discuss fhis case in Secfion 7.3.4. 

NAT was inf reduced fo solve fwo problems: address deplefion and con¬ 
cerns regarding fhe scalabilify of roufing. Af fhe fime of ifs infroducfion (early 
1990s), NAT was suggesfed as a stopgap, temporary measure fo be used unfil fhe 
deploymenf of some profocol wifh a larger number of addresses (ulfimafely, IPv6) 
became widespread. Roufing scalabilify was being fackled wifh fhe developmenf 
of Classless Infer-Domain Roufing (CIDR; see Chapfer 2). NAT is popular because 
if reduces fhe need for globally roufable Infernef addresses buf also because if 
offers some degree of nafural firewall capabilify and requires little configurafion. 
Perhaps ironically, fhe developmenf and evenfual widespread use of NAT has con- 
fribufed fo significanfly slow fhe adopfion of IPv6. Among ifs ofher benefifs, IPv6 
was infended fo make NAT unnecessary [RFC4864]. 

Despife ifs popularify, NAT has several drawbacks. The mosf obvious is fhaf 
offering Infernef-accessible services from fhe privafe side of a NAT requires spe¬ 
cial configurafion because privately addressed sysfems are nof direcfly reach¬ 
able from fhe Infernef. In addifion, for a NAT fo work properly, every packef in 
bofh direcfions of a connecfion or associafion musf pass fhrough fhe same NAT. 
This is because fhe NAT musf acfively rewrite fhe addressing informafion in each 
packef in order for communicafion befween a privafely addressed sysfem and a 
convenfionally addressed Infernef hosf fo work. In many ways, NATs run counter 
fo a fundamenfal fenef of fhe Infernef protocols: fhe "smarf edge" and "dumb 
middle." To do fheir job, NATs require connecfion sfafe on a per-association (or 
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per-connection) basis and must operate across multiple protocol layers, unlike con¬ 
ventional routers. Modifying an address at the IP layer also requires modifying 
checksums af fhe fransporf layer (see Chapfers 10 and 13 regarding fhe pseudo¬ 
header checksum fo see why). 

NAT poses problems for some applicafion protocols, especially fhose fhaf 
send IP addressing informafion inside fhe applicafion-layer payload. The File 
Transfer Protocol (FTP) [RFC0959] and SIP [RFC5411] are among fhe besf-known 
profocols of fhis type. They require a special application-layer gateway function 
that rewrites the application content in order to work unmodified wifh NAT or 
ofher NAT fraversal mefhods fhaf allow fhe applicafions fo defermine how fo 
work wifh fhe specific NAT fhey are using. A more complete lisf of considerafions 
regarding NAT appears in [RFC3027]. Despite fheir numerous problems, NATs 
are very widely used, and mosf nefwork roufers (including essenfially all low-end 
home roufers) supporf if. Today, NATs are so prevalenf fhaf applicafion designers 
are encouraged fo make fheir applicafions "NAT-friendly" [RFC3235]. If is worfh 
menfioning fhaf despife ifs shorfcomings, NAT supporfs fhe basic profocols (e.g., 
e-mail, Web browsing) fhaf are needed by millions of clienf systems accessing fhe 
Infernef every day. 

A NAT works by rewrifing fhe idenfifying informafion in packefs fransif- 
ing fhrough a router. Mosf commonly fhis happens for fwo direcfions of a dafa 
fransfer. In ifs mosf basic form, NAT involves rewrifing fhe source IP address of 
packefs as fhey are forwarded in one direcfion and fhe desfinafion IP addresses of 
packefs fraveling in fhe reverse direcfion (see Figure 7-3). This allows fhe source 
IP address in oufgoing packefs fo become one of fhe NAT roufer's Infernef-facing 
interfaces instead of fhe originafing hosf's. Thus, fo a hosf on fhe Infernef, packefs 
coming from any of fhe hosfs on fhe privately addressed side of fhe NAT appear 
fo be coming from a globally roufable IP address of fhe NAT roufer. 



Private IP Addressing Realm 
(Home/Enterprise) 


Public IP Addressing Realm 
(Internet) 


Figure 7-3 A NAT isolates private addresses and the systems using them from the Internet. Packets 
with private addresses are not routed by the Internet directly but instead must be trans¬ 
lated as they enter and leave the private network through the NAT router. Internet hosts 
see traffic as coming from a public IP address of the NAT. 




Section 7.3 Network Address Translation (NAT) 


305 


Most NATs perform both translation and packet filtering, and the packet-filtering 
criteria depend on the dynamics of the NAT state. The choice of packet-filtering 
policy may have a different granularity—for example, the treatment of unsolic¬ 
ited packets (those not associated with packets originating from behind the NAT) 
received by the NAT may depend on source and destination IP address and/or 
source and destination port number. The behavior may vary between NATs or in 
some cases vary over time through the same NAT. This presents challenges for 
applications that must operate behind a wide variety of NATs. 

7.3.1 Traditional NAT: Basic NAT and NAPT 

The precise behavior of a NAT remained unspecified for many years. Nonetheless, 
a taxonomy of NAT types has emerged, based largely on observing how different 
implementations of the NAT idea behave. The so-called traditional NAT includes 
both basic NAT and Network Address Port Translation (NAPT) [RFC3022]. Basic NAT 
performs rewriting of IP addresses only. In essence, a private address is rewritten 
to be a public address, often from a pool or range of public addresses supplied 
by an ISP This type of NAT is not the most popular because it does not help to 
dramatically reduce the need for IP addresses—the number of globally routable 
addresses must equal or exceed the number of internal hosts that wish to access 
the Internet simultaneously. A much more popular approach, NAPT involves 
using the transport-layer identifiers (i.e., ports for TCP and UDP, query identifiers 
for ICMP) to differentiate which host on the private side of the NAT is associated 
with a particular packet (see Figure 7-4). This allows a large number of internal 
hosts (i.e., multiple thousands) to access the Internet simultaneously using a lim¬ 
ited number of public addresses, often only a single one. We shall ordinarily use 
the term NAT to include both traditional NAT and NAPT unless the distinction is 
important in a particular context. 



Figure 7-4 A basic IPv4 NAT (left) rewrites IP addresses from a pool of addresses and leaves port numbers 
unchanged. NAPT (right), also known as IP masquerading, usually rewrites address to a single 
address. NAPT must sometimes rewrite port numbers in order to avoid collisions. In this case, the 
second instance of port number 23479 was rewritten to use port number 3000 so that returning 
traffic for 192.168.1.2 could be distinguished from the traffic returning to 192.168.1.35. 
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The addresses used in a private addressing realm "behind" or "inside" a NAT 
are not enforced by anyone other than the local network administrator. Thus, it is 
possible for a privafe realm fo make use of global address space. In principle, fhis 
is accepfable. However, when such global addresses are owned and being used 
by anofher enfify on fhe Infernef, local sysfems in fhe privafe realm would mosf 
likely be unable fo reach fhe public sysfems using fhe same addresses because fhe 
close proximify of fhe local sysfems would effecfively "mask" fhe visibilify of fhe 
farfher-away sysfems using fhe same addresses. To avoid fhis undesirable sifua- 
fion, fhere are fhree IPv4 address ranges reserved for use wifh privafe address¬ 
ing realms [RFC1918]: 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16. These address 
ranges are offen used as defaulf values for address pools in embedded DHCP 
servers (see Chapfer 6). 

As suggesfed earlier, a NAT provides some degree of securify similar fo fhaf 
of a firewall. By defaulf, all sysfems on fhe privafe side of fhe NAT cannof be 
reached from fhe Infernef. In mosf NAT deploymenfs, fhe infernal sysfems use 
privafe addresses. Consequenfly, communicafions befween hosfs in fhe privafe 
addressing realm and fhose in fhe public realm can be facilifafed only wifh parfic- 
ipafion from fhe NAT, according fo ifs usage policies and behavior. While a large 
variefy of policies may be used in pracfice, a common policy allows almosf all 
oufgoing and refurning fraffic (associafed wifh oufgoing fraffic) fo pass fhrough 
fhe NAT buf blocks almosf all incoming new connecfion requesfs. This behav¬ 
ior inhibifs "probing" affacks fhaf affempf fo ascerfain which IP addresses have 
acfive hosfs available fo exploif. In addifion, a NAT (especially a NAPT) "hides" 
fhe number and configurafion of infernal addresses from fhe oufside. Some users 
feel fhis topology informafion is propriefary and should remain confidenfial. NAT 
helps by providing so-called topology hiding. 

As we shall now explore, NATs are failored fo fhe protocols and applicafions 
fhaf fhey need fo supporf, so if is difficulf fo discuss NAT behavior wifhouf also 
menfioning fhe parficular profocol(s) if is being asked fo handle. Thus, we now 
furn fo how NAT behaves wifh each major fransporf protocol and how if may be 
used in mixed IPv4/IPv6 environmenfs. Many of fhe behavioral specifics for NATs 
have been fhe subjecf of fhe IETF Behavior Engineering for Hindrance Avoidance 
(BEHAVE) working group. BEHAVE has produced a number of documenfs, sfarf- 
ing in 2007, fhaf clarify consisfenf behaviors for NATs. These documenfs are useful 
for applicafion wrifers and NAT developers so fhaf a consisfenf expecfafion can be 
esfablished as fo how NATs should operafe. 

7.3.1.1 NAT and TCP 

Recall from Chapfer 1 fhaf fhe primary fransporf-layer protocol for fhe Infernef, 
TCP, uses an IP address and porf number fo idenfify each end of a connecfion. A 
connecfion is idenfified by fhe combinafion of fwo ends; each unique TCP con¬ 
necfion is idenfified by fwo IP addresses and fwo porf numbers. When a TCP 
connecfion sfarfs, an "acfive opener" or clienf usually sends a synchronizafion 
(SYN) packef fo a "passive opener" or server. The server responds wifh ifs own 
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SYN packet, which also includes an acknowledgment (ACK) of the client's SYN. 
The client then responds with an ACK to the server. This "three-way handshake" 
establishes the connection. A similar exchange with finish (FIN) packefs is used 
fo gracefully close a connecfion. The connecfion can also be forcefully closed righf 
away using a resef (RST) packef. (See Chapfer 13 for more defail on TCP connec- 
fions.) The behavioral requiremenfs for fradifional NAT wifh TCP are defined in 
[RFC5382] and relafe primarily fo fhe TCP fhree-way handshake. 

Referring fo fhe example home nefwork in Figure 7-3, consider a TCP con¬ 
necfion inifiafed by fhe wireless clienf af 10.0.0.126 desfined for fhe Web server 
on fhe hosf www.isoc.org (IPv4 address 212.110.167.157). Using fhe following 
nofafion fo indicafe IPv4 addresses and porf numbers—(source IP:source porf; 
desfinafion IP:desfinafion porf)—fhe packef inifiafing fhe connecfion on fhe pri- 
vafe segmenf mighf be addressed as (10.0.0.126:9200; 212.110.167.157:80). The NAT/ 
firewall device, acfing as fhe defaulf roufer for fhe clienf, receives fhe firsf packef. 
The NAT nofices fhaf fhe incoming packef is a new connecfion (because fhe SYN 
bif in fhe TCP header is fumed on; see Chapfer 13). If policy permifs (which if 
fypically does because fhis is an oufgoing connecfion), fhe source IP address 
is modified in fhe packef fo reflecf fhe roufable IP address of fhe NAT roufer's 
exfernal inferface. Thus, when fhe NAT forwards fhis packef, fhe addressing is 
(63.204.134.177:9200; 212.110.167.157:80). In addifion fo forwarding fhe packef, fhe 
NAT creafes infernal sfafe fo remember fhe facf fhaf a new connecfion is being 
handled by fhe NAT (called a NAT session). At a minimum, fhis sfafe includes an 
enfry (called a NAT mapping) confaining fhe source porf number and IP address 
of fhe clienf. This becomes useful when fhe Infernef server replies. The server 
replies fo fhe endpoinf (63.204.134.177:9200), fhe exfernal NAT address, using fhe 
porf number chosen inifially by fhe clienf. This behavior is called port preservation. 
By mafching fhe desfinafion porf number on fhe received dafagram againsf fhe 
appropriafe NAT mapping, fhe NAT is able fo ascerfain fhe infernal IP address of 
fhe clienf fhaf made fhe inifial requesf. In our example, fhis address is 10.0.0.126, so 
fhe NAT rewrifes fhe response packef from (212.110.167.157:80; 63.204.134.177:9200) 
fo (212.110.167.157:80; 10.0.0.126:9200) and forwards if. The clienf fhen receives a 
response fo ifs requesf and for mosf purposes is now connecfed fo fhe server. 

This example conveys how a basic NAT session is esfablished in fhe nor¬ 
mal case, buf nof how fhe session is cleared. Session sfafe is removed if FINs 
are exchanged, buf nof all TCP connecfions are cleared gracefully. Somefimes a 
compufer is simply fumed off, which can leave sfale NAT mappings in fhe NAT's 
memory. Thus, a NAT musf also remove mappings fhoughf fo have "gone dead" 
because of a lack of fraffic (or if an RST segmenf indicafes some ofher form of 
problem). 

Mosf NATs include a simplified version of fhe TCP connecfion esfablishmenf 
procedures and can disfinguish befween connecfion success and failure. In par- 
ficular, when an oufgoing SYN segmenf is observed, a connection timer is acfi- 
vafed, and if no ACK is seen before fhe fimer expires, fhe session sfafe is cleared. 
If an ACK does arrive, fhe fimer is canceled and a session timer is creafed, wifh a 
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considerably longer timeout (e.g., hours instead of minutes). When this happens, 
the NAT may send an additional packet to the internal endpoint, just to double¬ 
check if fhe session is indeed dead (called probing). If if receives an ACK, fhe NAT 
realizes fhaf fhe connecfion is sfill acfive, resefs fhe session fimer, and does nof 
delefe fhe session. If if receives eifher no response (affer a close timer has expired) 
or an RST segmenf, fhe connecfion has gone dead, and fhe sfafe is cleared. 

[RFC5382], a producf of fhe BEHAVE working group, nofes fhaf a TCP con¬ 
necfion can be configured fo send "keepalive" packefs (see Chapfer 17), and fhe 
defaulf rafe is one packef every 2 hours, if enabled. Ofherwise, a TCP connecfion 
can remain esfablished indefinifely. While a connecfion is being sef up or cleared, 
however, fhe maximum idle fime is 4 minufes. Consequenfly [RPC5382] requires 
(REQ-5) fhaf a NAT waif af leasf 2 hours and 4 minufes before concluding fhaf 
an esfablished connecfion is dead and af leasf 4 minufes before concluding fhaf a 
parfially opened or closed connecfion is dead. 

One of fhe fricky problems for a TCP NAT is handling peer-fo-peer applica- 
fions operafing on hosfs residing on fhe privafe sides of mulfiple NATs [RPC5128]. 
Some of fhese applicafions use a simultaneous open whereby each end of fhe con¬ 
necfion acfs as a clienf and sends SYN packefs more or less simulfaneously. TCP is 
able fo handle fhis case by responding wifh SYN + ACK packefs fhaf complefe fhe 
connecfion fasfer fhan wifh fhe fhree-way handshake, buf many exisfing NATs do 
nof handle if properly. [RPC5382] addresses fhis by requiring (REQ-2) fhaf a NAT 
handle all valid TCP packef exchanges, and simulfaneous opens in parficular. 
Some peer-fo-peer applicafions (e.g., nefwork games) use fhis behavior. In addi- 
fion, [RPC5382] specifies fhaf an inbound SYN for a connecfion abouf which fhe 
NAT knows nofhing should be silenfly discarded. This can occur when a simulfa¬ 
neous open is affempfed buf fhe exfernal hosf's SYN arrives af fhe NAT before fhe 
infernal hosf's SYN. Alfhough fhis may seem unlikely, if can happen as a resulf 
of clock skew, for example. If fhe incoming exfernal SYN is dropped, fhe infernal 
SYN has fime fo esfablish a NAT mapping for fhe same connecfion represenfed by 
fhe exfernal SYN. If no infernal SYN is forfhcoming in 6s, fhe NAT may signal an 
error fo fhe exfernal hosf. 

7.3.1.2 NATandUDP 

The NAT behavioral requiremenfs for unicasf UDP are defined in [RPC4787]. 
Mosf of fhe same issues arise when performing NAT on a collecfion of UDP dafa- 
grams as arise when performing NAT on TCP. UDP is somewhaf differenf, how¬ 
ever, because fhere are no connecfion esfablishmenf and clearing procedures as 
fhere are in TCP. More specifically, fhere are no indicafors such as fhe SYN, PIN, 
and RST bifs fo indicafe fhaf a session is being creafed or desfroyed. Purfhermore, 
fhe parficipanfs in an associafion may nof be complefely clear. UDP does nof use 
a 4-fuple fo idenfify a connecfion like TCP; insfead, if can rely on only fhe fwo 
endpoinf address/porf number combinafions. To handle fhese issues, UDP NATs 
use a mapping timer fo clear NAT sfafe if a binding has nof been used "recenfly." 
There is considerable variafion in fhe values used for fhis fimer fo defermine whaf 
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"recently" means, but [RFC4787] requires the timer to be at least 2 minutes and rec¬ 
ommends that it be 5 minutes. A related consideration is when the timer should be 
considered refreshed. Timers can be refreshed when packefs fravel from fhe inside 
fo fhe oufside of fhe NAT (NAT oufbound refresh behavior) or vice versa (NAT 
inbound refresh behavior). [RFC4787] requires NAT oufbound refresh behavior fo 
be frue. Inbound behavior may or may nof be frue. 

As we discussed in Chapfer 5 (and will see again in Chapfer 10), UDP and IP 
packefs can be fragmenfed. Fragmenfafion allows for a single IP packef fo span 
mulfiple chunks (fragmenfs), each of which is freafed as an independenf dafa- 
gram. However, because of fhe layering of UDP above IP, an IP fragmenf ofher 
fhan fhe firsf one does nof confain fhe porf number informafion needed by NAPT 
fo operafe properly. This also applies fo TCP and ICMP. Thus, in general, frag¬ 
menfs cannof be handled properly by simple NATs or NAPTs. 

7.3.1.3 NAT and Other Transport Protocols (DCCP, SCTP) 

Alfhough TCP and UDP are by far fhe mosf widely used Infernef fransporf pro¬ 
tocols, fhere are fwo ofher profocols for which NAT behaviors have been defined 
or are being defined. The Datagram Congestion Control Protocol (DCCP) [RFC4340] 
provides a congesfion-confrolled dafagram service. [RFC5597] gives NAT behav¬ 
ioral requiremenfs wifh respecf fo DCCP, and [RFC5596] gives a modificafion fo 
DCCP fo supporf a TCP-like simulfaneous open procedure for use wifh DCCP. The 
Stream Control Transmission Protocol (SCTP) [RFC4960] provides a reliable messag¬ 
ing service fhaf can accommodafe hosfs wifh mulfiple addresses. Considerafions 
for NAT wifh SCTP are given in [HBA09] and [IDSNAT]. 

7.3.1.4 NAT and ICMP 

ICMP, fhe Infernef Confrol Message Protocol, is defailed in Chapfer 8. If provides 
sfafus informafion abouf IP packefs and can also be used for making cerfain mea- 
suremenfs and gafhering informafion abouf fhe sfafe of fhe nefwork. The NAT 
behavioral requiremenfs for ICMP are defined in [RFC5508]. There are fwo issues 
involved when NAT is used for ICMP. ICMP has fwo categories of messages: infor- 
mafional and error. Error messages generally confain a (parfial or full) copy of fhe 
IP packef fhaf induced fhe error condifion. They are senf from fhe poinf where 
an error was defected, often in fhe middle of fhe nefwork, fo fhe original sender. 
Ordinarily, fhis presenfs no difficulfy, buf when an ICMP error message passes 
fhrough a NAT, fhe IP addresses in fhe included "offending dafagram" need fo 
be rewriffen by fhe NAT in order for fhem fo make sense fo fhe end clienf (called 
ICMP fix-up). For informafional messages, fhe same issues arise, buf in fhis case 
mosf message fypes are of a query/response or clienf/server nafure and include 
a Query ID field fhaf is handled much like porf numbers for TCP or UDP. Thus, 
a NAT handling fhese fypes of messages can recognize oufgoing informafional 
requesfs and sef a fimer in anficipafion of a refurning response. 
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7.3.1.5 NAT and Tunneled Packets 

In some cases, tunneled packets (see Chapter 3) are to be sent through a NAT. 
When this happens, not only must a NAT rewrite the IP header, but it may also 
have to rewrite the headers or payloads of other packets that are encapsulated in 
them. One example of fhis is fhe Generic Roufing Encapsulafion (GRE) header 
used wifh fhe Poinf-fo-Poinf Tunneling Protocol (PPTP; see Ghapfer 3). When 
fhe GRE header is passed fhrough a NAT, ifs Call-ID field could conflicf wifh fhe 
NAT's (or wifh ofher hosfs' funneled connecfions). If fhe NAT fails to handle fhis 
mapping appropriately, communicafion is nof possible. As we mighf imagine, 
addifional levels of encapsulafion serve only to complicate a NAT's job furfher. 

7.3.1.6 NAT and Multicast 

So far we have discussed only unicasf IP fraffic wifh NATs. NATs can be config¬ 
ured fo supporf mulficasf fraffic (see Ghapfer 9), alfhough fhis is rare. [RPG5135] 
gives fhe requiremenfs for handling mulficasf fraffic fhrough NATs. In effecf, fo 
supporf mulficasf fraffic a NAT is augmenfed wifh an IGMP proxy (see [RPG4605] 
and Chapter 9). In addifion, fhe desfinafion IP addresses and porf numbers of 
packefs fraveling from a hosf on fhe oufside fo fhe inside of NAT are not modified. 
Por fraffic flowing from inside fo oufside, fhe source addresses and porf numbers 
may be modified according fo fhe same behaviors as wifh unicasf UDP 

73.1.7 NAT and IPv6 

Given fhe fremendous popularify of NAT for IPv4, if is nafural fo wonder whefher 
NAT will be used wifh IPv6. Af presenf, fhis is a confenfious issue [RPG5902]. 
To many protocol designers, NAT arose as a necessary buf undesirable "warf" 
fhaf has added a fremendous amounf of complexify fo fhe design of every ofher 
profocol. Gonsequenfly, fhere is sfaunch resisfance fo supporfing fhe use of NAT 
wifh IPv6 based on fhe idea fhaf saving address space is unnecessary wifh IPv6 
and fhaf ofher desirable NAT feafures (e.g., firewall-like funcfionalify, topology 
hiding, and privacy) can be beffer achieved using Local Network Protection (LNP) 
[RPG4864]. LNP represenfs a collecfion of fechniques wifh IPv6 fhaf mafch or 
exceed fhe properfies of NATs. 

Aside from ifs packef-filfering properfies, NAT supporfs fhe coexisfence of 
mulfiple address realms and fhereby helps fo avoid fhe problem of a sife having 
fo change ifs IP addresses when if swifches ISPs. Por example, [RPG4193] defines 
Unique Local IPv6 Unicast Addresses (ULAs) fhaf could conceivably be used wifh an 
experimenfal version of IPv6-fo-IPv6 prefix franslafion called NPTvG [RPG6296]. If 
uses an algorifhm insfead of a fable fo franslafe IPv6 addresses fo (differenf) IPv6 
addresses (e.g., in differenf realms) based on fheir prefix and as a resulf does nof 
require keeping per-connecfion sfafe as wifh convenfional NAT. In addifion, fhe 
algorifhm modifies addresses in such a way fhaf fhe resulfing checksum compu- 
fafion for common fransporf protocols (TGP, UDP) remains fhe same. This sig- 
nificanfly reduces fhe complexify of NAT because if does nof have fo modify fhe 
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data in a packet beyond the network layer and does not require access to trans¬ 
port layer port numbers in order to operate properly However, applications that 
require access to a NAT's external address must still use a NAT traversal method 
or depend on an ALG. In addition, NPTv6 does not by itself offer fhe packef-fil- 
fering capabilifies of a firewall, so addifional deploymenf considerafions musf be 
made. 

7.3.2 Address and Port Translation Behavior 

There has been considerable variation in the way NATs operate. Most of the details 
relate to the specifics of the address and port mappings. One of the primary goals 
of the BEHAVE working group in lETE was to clarify the common behaviors and 
set guidelines as to which are the most appropriate. To better understand the 
issues involved, we begin with a generic NAT mapping example (see Eigure 7-5). 



Figure 7-5 A NAT's address and port behavior is characterized by what its mappings depend on. 

The inside host uses IP address:port X:x to contact Yl:yl and then Yl-.yl. The address and 
port used by the NAT for these associations are Xl'-.xV and X2':x2', respectively. If Xl'-.xl' 
equals X2':z2' for any Yl:yl or Y2\y2, the NAT has endpoint-independent mappings. If 
Xl':xl' equals X2':z2' if and only if Y1 equals Y2, the NAT has address-dependent map¬ 
pings. If Xl'-.xl' equals X2':z2' if and only if Yl-.yl equals Y2-.y2, the NAT has address- 
and port-dependent mappings. A NAT with multiple external addresses (i.e., where XI' 
may not equal X2') has an address pooling behavior of arbitrary if the outside address is 
chosen without regard to inside or outside address. Alternatively, it may have a pooling 
behavior of paired, in which case the same XI is used for any association with Yl. 


In Eigure 7-5, we use the notation X:x to indicate that a host in the private 
addressing realm (inside host) uses IP address X with port number j (for ICMP, 
the query ID is used instead of the port number). The IP address X is ordinarily 
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chosen from the private IPv4 address space defined in [RFC1918]. To reach the 
remote address/port combination Y:y, the NAT establishes a mapping using an 
external (usually a public, globally routable) address XI' and port number xl'. 
Assuming that the internal host contacts Yl.'i/l followed by Y2:y2, the NAT estab¬ 
lishes mappings Xl':xl' and X2':x2', respectively In most cases, XI' equals X2' 
because most sites use only a single globally routable IP address. The mapping is 
said to be reused if xl' equals x2'. If xl' and x2' equal x, the NAT implements port 
preservation, mentioned earlier. In some cases, port preservation is not possible, 
so the NAT must deal with port collisions as suggested by Figure 7-4. 

Table 7-1 and Figure 7-5 summarize the various NAT port and address behav¬ 
iors based on definitions from [RFC4787]. Table 7-1 also gives filtering behaviors 
that use similar terminology and that we discuss in Section 7.3.3. For all common 
transports, including TCP and UDP, the required NAT address- and port-handling 
behavior is endpoint-independent (a similar behavior is recommended for ICMP). 
The purpose of this requirement is to help applications that attempt to determine 
the external addresses used for their traffic to work more reliably. We discuss this 
in more detail in Section 7.4 when we discuss NAT traversal. 


Table 7-1 A NAT's overall behavior is defined by both its translation and filtering behaviors. Each of these 
may be independent of host address, dependent on address, or dependent on both address and port 
number. 


Behavior Name 

Translation Behavior 

Filtering Behavior 

Endpoint- 

independent 

Xl':xr = X2':x2' for all 
Y2:y2 (required) 

Allows any packets for XV.xl as long as any Xl':xl' 
exists (recommended for greatest transparency) 

Address- 

dependent 

Xl':xr = X2':x2' iff 

Y1 = Y2 

Allows packets for Xl:xl from Yl:yl as long as Xl 
has previously contacted Y1 (recommended for 
more stringent filtering) 

Address- and 
port-dependent 

Xl':xl' = X2':x2' iff 

Yl:yl = Y2:y2 

Allows packets for Xl:xl from Yl:yl as long as Xl 
has previously contacted Yl:yl 


As stated previously, a NAT may have several external addresses available to 
use. The set of addresses is typically called the NAT pool or NAT address pool. Most 
moderate to large-scale NATs use address pools. Note that NAT address pools are 
distinct from the DHCP address pools discussed in Chapter 6, although a single 
device may need to handle both NAT and DHCP address pools. One question 
in such environments is that when a single host behind the NAT opens multiple 
simultaneous connections, is each assigned the same external IP address (called 
address pairing) or not? A NAT's IP address pooling behavior is said to be arbitrary 
if there is no restriction on which external address is used for any association. It 
is said to be paired if it implements address pairing. Pairing is the recommended 
NAT behavior for all transports. If pairing is not used, the communication peer 
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of an internal host may erroneously conclude that it is communicating with dif¬ 
ferent hosts. For NATs with only a single external address, this is obviously not a 
problem. 

A very brittle type of NAT overloads not only addresses but also ports (called 
port overloading). In this case, the traffic of multiple internal hosts may be rewrit¬ 
ten to the same external IP address and port number. This is a dangerous prospect 
because if multiple hosts associate with a service on the same external host, it is 
no longer possible to determine the appropriate destination for traffic returning 
from the external host. For TCP, this is a consequence of all four elements of the 
connection identifier (source and destination address and port numbers) being 
identical in the external network among the various connections. Such behavior 
is now disallowed. 

Some NATs implement a special feature called port parity. Such NATs attempt 
to preserve the "parity" (evenness or oddness) of port numbers. Thus, if xl is even, 
xl' is even and vice versa. Although not as strong as port preservation, such behav¬ 
ior is sometimes useful for specific application protocols that use special port 
numberings (e.g., the Real-Time Protocol, abbreviated RTF, has traditionally used 
multiple ports, but there are proposed methods for avoiding this issue [RFC5761]). 
Port parity preservation is a recommended NAT feature but not a requirement. It 
is also expected to become less important over time as more sophisticated NAT 
traversal methods become widespread. 

7.3.3 Filtering Behavior 

When a NAT creates a binding for a TCP connection, UDP association, or vari¬ 
ous forms of ICMP traffic, not only does it establish the address and port map¬ 
pings, but it must also determine its filtering behavior for the returning traffic if 
it acts as a firewall, which is the common case. The type of filtering a NAT per¬ 
forms, although logically distinct from its address- and port-handling behavior, is 
often related. In particular, the same terminology is used: endpoint-independent, 
address-dependent, and address- and port-dependent. 

A NAT's filtering behavior is usually related to whether it has established an 
address mapping. Clearly, a NAT lacking any form of address mapping is unable 
to forward any traffic it receives from the outside to the inside because it would 
not know which internal destination to use. For the most common case of outgo¬ 
ing traffic, when a binding is established, filtering is disabled for relevant return 
traffic. For NATs with endpoint-independent behavior, as soon as any mapping 
is established for an internal host, any incoming traffic is permitted, regardless 
of source. For address-dependent filtering behavior, traffic destined for Xl:xl 
is permitted from Yl.'i/l only if Y1 had been previously contacted by XT.xl. For 
those NATs with address- and port-dependent filtering behavior, traffic destined 
for Xl:xl is permitted from YT.yl only if YT.yl had been previously contacted 
by XT.xl. The difference between the last two is that the last form takes the port 
number yl into account. 
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7.3.4 Servers behind NATs 

One of the most obvious problems with NATs is that a system wishing to provide 
a service from behind a NAT is not directly reachable from the outside. Consider 
the example in Figure 7-3 once again. If the host with address 10.0.0.3 is to pro¬ 
vide a service to the Internet, it cannot be reached without participation from the 
NAT, for at least two reasons. First, the NAT is acting as the Internet router, so it 
must agree to forward the incoming traffic destined for 10.0.0.3. Second, and more 
important, the IP address 10.0.0.3 is not routable through the Internet and cannot 
be used to identify the server by hosts in the Internet. Instead, the external address 
of the NAT must be used to find the server, and the NAT must arrange to properly 
rewrite and forward the appropriate traffic to the server so that it can operate. This 
process is most often called port forwarding or port mapping. 

With port forwarding, incoming traffic to a NAT is forwarded to a specific 
configured destination behind the NAT. By employing NAT with port forward¬ 
ing, it is possible to allow servers to provide services to the Internet even though 
they may be assigned private, nonroutable addresses. Port forwarding typically 
requires static configuration of the NAT with the address of the server and the 
associated port number whose traffic should be forwarded. The port forwarding 
directive acts like an always-present static NAT mapping. If the server's IP address 
is changed, the NAT must be updated with the new addressing information. Port 
forwarding also has the limitation that it has only one set of port numbers for each 
of its (IP address, transport protocol) combinations. Thus, if the NAT has only a 
single external IP address, it can forward only a single port of the same transport 
protocol to at most one internal machine (e.g., it could not support two indepen¬ 
dent Web servers on the inside being remotely accessible using TCP port 80 from 
the outside). 

7.3.5 Hairpinning and NAT Loopback 

An interesting issue arises when a client wishes to reach a server and both reside on 
the same, private side of the same NAT. NATs that support this scenario implement 
so-called hairpinning or NAT loopback. Referring to Figure 7-6, assume that host XI 
attempts to establish a connection to host X2. If XI knows the private address¬ 
ing information, X2:x2, there is no problem because the connection can be made 
directly. However, in some cases XI knows only the public address information, 
X2':x2'. In these cases, XI attempts to contact X2 using the NAT with destination 
X2':x2'. The hairpinning process takes place when the NAT notices the existence of 
the mapping between X2':x2' and X2:x2 and forwards the packet to X2:x2 residing 
on the private side of the NAT. At this point a question arises as to which source 
address is contained in the packet heading to X2:x2 — Xl:xl or XT:xT? 

If the NAT presents the hairpinned packet to X2 with source addressing 
information Xl':xT, the NAT is said to have "external source IP address and port" 
hairpinning behavior. This behavior is required for TCP NAT [RFC5382]. The 
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Figure 7-6 A NAT that implements hairpinning or NAT loopback allows a client to reach a server on 
the same side of the NAT using the server's external IP address and port numbers. That 
is, XI can reach X2:x2 using the addressing information X2':x2'. 


justification for requiring this behavior is for applications that identify their peers 
using globally routable addresses. In our example, X2 may be expecting an incom¬ 
ing connection from XI' (e.g., because of coordination from a third-party system). 

7.3.6 NAT Editors 

Together, packets using the UDP and TCP transport protocols account for most 
of the IP traffic carried on the Internet. These transport protocols, by themselves, 
can be supported by NAT without additional complexity because their formats are 
well understood. When application-layer protocols used in conjunction with them 
carry transport-layer or lower-layer information such as IP addresses, the NAT 
problem becomes considerably more complicated. The most common example is 
FTP [RFC0959]. In normal operation, it communicates transport- and network-layer 
endpoint information (an IP address and port number) so that additional connec¬ 
tions can be made when bulk data is to be transferred. This requires a NAT to 
rewrite not only the IP addresses and port numbers in the IP and TCP portions of a 
datagram, but also some of the application payload itself. NATs with this capability 
are sometimes called NAT editors. If a NAT changes the size of a packet's appli¬ 
cation payload, considerable work may be required. For example, TCP numbers 
every byte in the data transfer using sequence numbers (see Chapter 15), so if the 
size of a packet is changed, the sequence numbers also require modification. PPTP 
[RFC2637] also requires a NAT editor for transparent operation (see Chapter 3). 

7.3.7 Service Provider NAT (SPNAT) and Service Provider iPv6 Transition 

A relatively recent development involves the idea of moving NATs from the 
customer premises into the ISP. This is sometimes called service provider NAT 
(SPNAT), carrier-grade NAT (CGN), or large-scale NAT (LSN) and is intended to 
further mitigate the IPv4 address depletion problem. With SPNAT, it is conceivable 
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that many ISP customers could share a single global IPv4 address. In effect, this 
moves the point of aggregafion from fhe edge of fhe customer to fhe edge of fhe 
ISP. In ifs basic form, fhere is no funcfional difference befween convenfional NAT 
and SPNAT; fhe difference is really in fhe proposed domain of use. However, mov¬ 
ing fhe NAT funcfion from customer to ISP raises securify concerns and brings 
info quesfion whefher individual end users are able to deploy Infernef servers 
and confrol firewall policy [MBCB08]. A sfudy from 2009 found fhaf a significanf 
number of users accepf incoming connecfions, largely because of peer-fo-peer 
programs [ANM09]. 

SPNAT can help wifh fhe IPv4 address deplefion problem, buf IPv6 is viewed 
as fhe ulfimafe solufion. For a number of reasons already discussed, however, IPv6 
deploymenf has lagged expecfafions. Originally, a scheme known as dual-sfack 
(see [RFC4213]), whereby each system uses bofh IPv6 and IPv4 addresses, was 
intended fo supporf fransifion to IPv6, buf fhis approach was anficipafed to be 
femporary and rendered unnecessary long before fhe deplefion of IPv4 addresses. 
An arguably more pragmafic approach is now being underfaken fhaf combines 
funneling, address franslafion, and dual-sfack systems in various configurafions. 
WeTl discuss some of fhese in Secfion 7.6 after exploring fhe mefhods fhaf have 
been developed for dealing wifh exisfing NATs. 


7.4 NAT Traversal 

As an alfernafive fo fhe complexify of placing ALGs and NAT editors in NAT 
devices, an applicafion may affempf fo perform ifs own NAT traversal. Usually 
fhis involves fhe applicafion frying fo ascerfain fhe exfernal IP address and porf 
numbers fhaf will be used when ifs fraffic passes fhrough a NAT and modifying 
ifs protocol operafions accordingly. If an applicafion is disfribufed across fhe nef- 
work (e.g., has mulfiple clienfs and servers, some of which are nof behind NATs), 
fhe servers can be used fo shuffle (copy) dafa befween fhe clienfs fhaf connecf 
from behind NATs or enable such clienfs fo discover each ofher's NAT bindings 
and possibly facilifafe direcf communicafion. Using a server fo copy dafa befween 
clienfs is usually a lasf-resorf opfion because of fhe overheads involved and pofen- 
fial for abuse. Consequenfly, mosf approaches affempf fo provide for some mefhod 
fhaf allows direcf communicafion. 

Direcf mefhods have been popular for peer-fo-peer file sharing, games, and 
communicafion applicafions. However, such fechniques are offen confined fo a 
parficular applicafion, meaning fhaf each new disfribufed applicafion requiring 
NAT fraversal fends fo implemenf ifs own mefhod(s). This can lead fo redundancy 
and inferoperabilify problems, ulfimafely increasing users' frusfrafion and cosf. 
To combaf fhis sifuafion, a sfandard approach for handling NAT fraversal has 
been esfablished, and if depends on a collecfion of several disfincf, subordinate 
profocols fhaf we discuss in fhe following secfions. For now, we begin wifh one of 
fhe more robusf yef nonsfandard approaches used by disfribufed applicafions. We 
fhen move on fo sfandardized frameworks for NAT fraversal. 
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7.4.1 Pinholes and Hole Punching 

As discussed previously, a NAT typically includes both traffic rewrifing and fil- 
fering capabilifies. When a NAT mapping is esfablished, fraffic for a parficular 
applicafion is usually permitted fo fraverse fhe NAT in bofh direcfions. Such map¬ 
pings are narrow; fhey usually apply only fo a single applicafion for ifs durafion of 
execufion. These fypes of mappings are called pinholes, because fhey are designed 
fo permif only a cerfain femporary fraffic flow (e.g., a pair of IP address and porf 
number combinafions). Pinholes are usually esfablished and removed dynami¬ 
cally as a consequence of communicafion befween programs. 

A mefhod fhaf affempfs fo allow fwo or more sysfems, each behind a NAT, 
fo communicafe direcfly using pinholes is called hole punching. If is described for 
UDP in Secfion 3.3 of [RFC5128] and for TCP in Secfion 3.4. To punch a hole, a 
clienf confacfs a known server using an oufgoing connecfion fhaf esfablishes a 
mapping in ifs local NAT. When anofher clienf confacfs fhe same server, fhe server 
has connecfions fo each of fhe clienfs and knows fheir exfernal addressing infor- 
mafion. If fhen exchanges fhe clienf exfernal addressing informafion befween fhe 
clienfs. Once fhis informafion is known, a clienf can affempf a direcf connecfion fo 
fhe ofher clienf(s). The popular Skype peer-fo-peer applicafion uses fhis approach 
(and some ofhers). 

Referring fo Figure 7-7, assume clienf A confacfs server SI followed by clienf B. 
SI will have learned A's and B's exfernal addressing informafion: IPv4 addresses 
192.0.2.201 and 203.0.113.100, respecfively. By sending B's informafion fo A and vice 
versa, A can affempf fo confacf B direcfly af ifs exfernal address (and vice versa). 
Whefher fhis will work depends on fhe type of NATs fhaf have been deployed. 
NAT sfafe for fhe (A,S1) connecfion lives in N1 and NAT sfafe for (B,S1) lives in 
bofh N2 and N3. If all NATs are endpoinf-independenf, fhis is sufficienf for direcf 
connecfions fo be possible. Any ofher fype of NAT will nof accepf fraffic from 
ofher fhan SI and will fhus prohibif direcf communicafion. Said anofher way, fhis 
approach fails if bofh hosfs are behind NATs wifh address-dependenf or address- 
and porf-dependenf mapping behavior. 

7.4.2 UNIIateral Self-Address Fixing (UNSAF) 

Applicafions employ a number of mefhods fo defermine fhe addresses fheir 
fraffic will use when passed fhrough a NAT. This is called fixing (learning and 
mainfaining) fhe addressing informafion. There are indirecf and direcf mefhods 
for address fixing. The indirecf mefhods involve inferring a NATs behavior by 
exchanging fraffic fhrough fhe NAT. The direcf mefhods involve a direcf conver- 
safion befween fhe applicafion and fhe NAT ifself using one or more special pro¬ 
tocols (fhaf are nof currenfly IETF sfandards). Considerable efforf wifhin IETF has 
gone info developmenf of fhe indirecf mefhods, and fhey are widely supporfed in 
cerfain applicafions, wifh VoIP applicafions being fhe mosf popular. Some of fhe 
direcf mefhods are now supporfed by some NATs. These mefhods also provide for 
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Client B 


Figure 7-7 Applications running on clients behind a NAT may require help from a server to engage 
in direct communication. In hole punching, a server, often specialized for a specific 
application, provides rendezvous information among clients that first establish NAT 
state and then perform direct communication, if possible. Some applications attempt to 
"fix" (determine and maintain) the addresses (and port numbers) their traffic will be 
assigned when passing through a NAT using standard generic protocols. These methods 
may encounter troubles in certain situations such as environments with multiple levels 
of NAT. In this example, client A's external address visible at SI is 192.0.2.201 and client 
B's is 203.0.113.100. At S2, however, B's external address is 10.0.1.1. 


basic configuration of NATs, so we discuss them later in the context of NAT setup 
and configuration. 

An application attempting to fix its address without help from the NAT per¬ 
forms the address fixing in a so-called unilateral fashion. Applications that do 
so are said to perform UNilateml Self-Address Fixing (UNSAF) [RFC3424]. As the 
name suggests, such methods are considered to be undesirable in the long run 
but a necessary evil for the time being. UNSAF involves a set of heuristics and 
is not guaranteed to work in all cases, especially because NAT behaviors vary 
significantly based on vendor and particular circumstance. The BEFIAVE docu¬ 
ments mentioned earlier are aimed at specifying more consistent NAT behavior. If 
widely adopted, UNSAE methods will work more reliably. 

In most cases of interest, UNSAE methods operate in a client/server fashion 
similar to hole punching, but with added generality. Eigure 7-7 illustrates some 
of the hazards that can arise in this situation. One issue is the lack of a single 
"outside" address realm for every NAT. In this example, there are two levels of 
NAT between client B and server SI. This situation can cause complications. Eor 
example, if an application on B wishes to obtain its "outside" address by using 
UNSAE with a server, it receives different answers depending on whether it con¬ 
tacts server SI or S2. Einally, because UNSAE uses servers that are distinct from 
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the NATs, there is always the possibility that the NAT behavior reported will 
change over time or become inconsistent with what the UNSAF approach reports. 

Given the various problems with NATs and UNSAF, the lAB, an elected group 
of architectural advisers within the IETF, has indicated that UNSAF protocol pro¬ 
posals must include responses to the concerns in their specifications: 

1. Define a limifed-scope problem fhaf fhe "shorf-ferm" UNSAF proposal 
addresses. 

2. Define an exif sfrafegy/fransifion plan. 

3. Discuss whaf design decisions make fhe approach "briffle." 

4. Idenfify requiremenfs for longer-ferm, sound fechnical solufions. 

5. Discuss any nofed pracfical issues or experiences known. 

This is an unusual lisf of requiremenfs imposed on a profocol specificafion, 
buf if resulfs from long-sfanding inferoperabilify problems befween differenf 
NATs and NAT fraversal fechniques. Despife all fhe aforemenfioned problems, 
UNSAF mefhods are commonly used, parfly because a wide range of NATs are 
found in operafion today wifh liffie consisfenf behavior. We now look af how fhese 
mefhods are used as building blocks to form robusf, general-purpose NAT fra¬ 
versal fechniques to maximize fhe chances fhaf communicafion among sysfems 
behind NATs, even befween sysfems across mulfiple NATs such as fhe one illus- 
frafed in Figure 7-7, will be possible. 

7.4.3 Session Traversal Utilities for NAT (STUN) 

One of fhe primary workhorses for UNSAF and NAT fraversal is called Session 
Traversal Utilities for NAT (STUN) [RFC5389]. STUN has evolved from a previ¬ 
ous version called Simple Tunneling of UDP through NATs, now known as "classic 
STUN." Classic STUN has been used wifh VoIP/SIP applicafions for some fime 
buf has been revised to be a fool fhaf can be used by ofher protocols for perform¬ 
ing NAT fraversal. Applicafions requiring a complefe solufion for NAT fraversal 
are recommended fo begin wifh ofher mechanisms we discuss in Secfion 7.4.5 
(e.g., ICE and SIP-Oufbound). These frameworks may make use of STUN in one 
or more parficular ways called STUN usages. Usages may extend fhe sef of basic 
STUN operaf ions, message fypes, or error codes defined in [REC5389]. 

STUN is a relafively simple clienf/server profocol fhaf is able fo ascerfain 
fhe exfernal IP address and porf numbers being used on a NAT in mosf circum- 
sfances. If can also keep NAT bindings currenf by using keepalive messages. If 
requires a cooperafing server on fhe "ofher" side of a NAT fo be effecfive, and sev¬ 
eral public STUN servers are configured wifh globally reachable IP addresses and 
are available for use on fhe Infernef. The main job of a STUN server is fo echo back 
STUN requesfs senf fo if in a way fhaf allows fhe clienf addressing informafion fo 
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be fixed. As with UNSAF methods in general, the approach is not foolproof. How¬ 
ever, the attraction of STUN is that it does not require modification of network 
routers, application protocols, or servers. It requires only that clients implement 
the STUN request protocol, and that at least one STUN server be available in an 
appropriate location. STUN was envisioned as a "temporary" measure (as were 
many standard protocols now in widespread use a decade or more after their cre¬ 
ation) until a more sophisticated direct protocol was developed and implemented, 
or NATs became obsolete because of the adoption of IPv6. 

STUN operates using UDP, TCP, or TCP with Transport Layer Security (TLS; see 
Chapter 18). STUN usage specifications define which transport protocols are sup¬ 
ported for the particular usage. It uses port 3478 for UDP and TCP, and 3479 for 
TCP/TLS. The STUN base protocol has two types of transactions: request/response 
transactions and indication transactions. Indications do not require a response and 
can be generated by either the client or the server. All messages include a type, 
length, magic cookie with value 0x2112A442, and a random 96-bit transaction ID 
used for matching requests with responses or for debugging. Each message begins 
with two 0 bits and may contain zero or more attributes. STUN message types are 
defined in the context of methods that support a particular STUN usage. The vari¬ 
ous STUN parameters, including method and attribute numbers, are maintained 
by the lANA [ISP]. Attributes have their own types and can vary in length. The 
basic STUN header, most often located immediately following a UDP transport 
header in an IP packet, is shown in Figure 7-8. 

The basic STUN header is 20 bytes in length (see Figure 7-8), and the Mes- 
sage Length field provides for an entire STUN message length of 2'*’ - 1 bytes (the 
20-byte header length is not included in the Message Length field), although mes¬ 
sages are always padded to a multiple of 4 bytes so this field always has its 2 
low-order bits set to 0. STUN messages sent over UDP/IP are supposed to form IP 
datagrams less than the path MTU, if known, to avoid fragmentation (see Chapter 
10). If not known, the entire datagram length (including IP and UDP headers and 
any options) should be less than 576 bytes (IPv4) or 1280 bytes (IPv6). STUN has 
no provision for cases where a response might exceed the path MTU in the reverse 
direction, so servers should arrange to use messages of appropriate size. 

STUN messages carried over UDP/IP are not reliable, so STUN applications 
are required to implement their own reliability. This is accomplished by resend¬ 
ing messages thought to be lost. The retransmission interval is based on the esti¬ 
mated time to send and receive a message from the peer called the round-trip time 
(RTT). RTT computation and setting retransmission timers will be a major con¬ 
sideration when we discuss TCP (see Chapter 14). STUN uses a similar approach, 
but with minor modifications to the standard TCP values. See [RFC5389] for more 
details. Reliability issues for STUN over TCP/IP or TCP-with-TLS/IP are handled 
by TCP. Multiple pending STUN transactions can be supported over TCP-based 
connections. 

STUN attributes are encoded in a TLV arrangement, a technique used by sev¬ 
eral other Internet protocols. The type and length portions of a TLV are each 16 
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000000001001: channel bind 
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(see [RFC5766]) 


M 

M 

M 

M 

M 

C 

M 

M 

M 

C 

M 

M 

M 

M 

11 

10 

9 

8 

7 

1 

6 

5 

4 

0 

3 

2 

1 

0 


Class (2 bits) 


00: request 
01: indication 
10: response 
(success) 

11: response (error) 


Figure 7-8 STUN messages always begin with two 0 bits and are usually encapsulated in UDP, although 
TCP is also allowed. The Message Type field gives both the method (e.g., binding) as well as class 
(request response, error, or success). The Transaction ID is a random 96-bit number used to match 
requests with responses, or for debugging in the case of indications. Each STUN message can hold 
zero or more attributes, depending on the particular usage of STUN. 


bits, and the value portion is variable-length (up to 64KB, if supported), but pad¬ 
ded to the next multiple of 4 byfes (padding bifs may be any value). The same 
affribufe fype may appear more fhan once in fhe same STUN message, alfhough 
only fhe firsf is necessarily processed by a receiver. Affribufes wifh type numbers 
below 0x8000 are called comprehension-required attributes, and the others are called 
comprehension-optional attributes. If a STUN agenf receives a message confaining 
comprehension-required affribufes if does nof know how fo process, if generafes 
an error. Mosf of fhe affribufes defined fo dafe are comprehension-required [ISP]. 

[RFC5389] defines a single STUN mefhod called binding, which can be used in 
eifher requesf/response or indicafion fransacfions for address fixing and keeping 
NAT bindings currenf. If also defines 11 affribufes, given in Table 7-2. 
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Table 7-2 STUN, defined in [RFC5389] and sometimes called STUN2, replaces classic STUN. These 11 attri¬ 
butes may be used by a STUN2-compliant client or server. 


Name 

Value 

Purpose/Use 

MAPPED-ADDRESS 

0x0001 

Contains an address family indicator and the 
reflexive transport address (IPv4 or IPv6) 

USERNAME 

0x0006 

User name and password; used for message 
integrity checks (up to 513 bytes) 

MESSAGE-INTEGRITY 

0x0008 

Message authentication code value on the STUN 
message (see Chapter 18 and [RFC5389]) 

ERROR-CODE 

0x0009 

Contains 3-bit error class, 8-bit error code value, 
and variable-length textual description of error 

UNKNOWN-ATTRIBUTES 

OxOOOA 

Used with error messages to indicate the unknown 
attributes (one 16-bit value per attribute) 

REALM 

0x0014 

Indicates the authentication "realm" name for long¬ 
term credentials 

NONCE 

0x0015 

Nonrepeated value optionally carried in requests 
and responses (see Chapter 18) to prevent replay 
attacks 

XOR-MAPPED-ADDRESS 

0x0020 

XORed version of MAPPED-ADDRESS 

SOETWARE 

0x8022 

Textual description of the software that sent the 
message (e.g., manufacturer and version number) 

ALTERNATE-SERVER 

0x8023 

Provides an alternate IP address for a client to use; 
encoded as with MAPPED-ADDRESS 

EINGERPRINT 

0x8028 

CRC-32 of message XORed with 0x5354554E; must 
be last attribute if used (optional) 


Referring to Figure 7-5, a STUN client with addressing information X:x is 
often interested in determining Xl':xl', called the reflexive transport address or 
mapped address. A STUN server at Yl.'i/l includes the reflexive transport address 
in a MAPPED-ADDRESS attribute in a STUN message returned to the client. The 
MAPPED-ADDRESS attribute holds an 8-bit Address Family field, a 16-bit Port 
Number field, and either a 32-bit or 128-bit Address field, depending on whether 
IPv4 or IPv6 is indicated by the Address Family field (0x01 for IPv4; 0x02 for IPv6). 
This attribute is included to remain backward-compatible with classic STUN. The 
more important attribute is the XOR-MAPPED-ADDRESS attribute, which holds 
exactly the same value as the MAPPED-ADDRESS attribute, but XORed with the 
magic cookie value (for IPv4) or a concatenation with the magic cookie and trans¬ 
action ID values (for IPv6). The reason for using XORed values in this way is to 
detect and bypass generic ALGs that look through packets and rewrite whatever 
IP addresses they find. Such ALGs are very brittle because they may rewrite infor¬ 
mation that protocols such as STUN require. Experience has shown that XORing 
IP addresses in the packet payload is usually sufficient to bypass such ALGs. 
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A STUN client, including most VoIP devices and "softphone" applications such 
as pjsua [PJSUA], is initially configured with the IP address (es) or names of one or 
more STUN servers, if is desirable fo use STUN servers fhaf are likely fo "see" fhe 
same IP addresses as fhe peer fo which fhe applicafion ulfimafely wishes fo falk, 
alfhough fhaf may be difficulf fo defermine. Using STUN servers locafed on fhe 
public Infernef (e.g., stun.ekiga.net, stun.xten.com, numb.viagenie.ca) 
is usually adequafe. Some servers may be discovered using DNS Service (SRV) 
records (see Chapfer 11). An example STUN binding requesf is given in Figure 7-9. 


C stun-binding.tr - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony lools Help 

EiWtfaiiiiesxSiSi I B| Q et » 

No. Time Protocol Source Destination Info 


10.000000 STUN 10.59.1.37 216.146.46.55 Binding Request 


2 0.069934 STUN 216.146.46.55 10.59.1.37 Binding Success Response v 

> 

ts Frame 1: 78 bytes on wire (624 bits), 78 bytes captured (624 bits) 
ffl Ethernet II, src: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d:91), DSt: 00:80:48:51:f5 
internet Protocol, src: 10.59.1.37 (10. 59.1.37), Dst: 216.146.46.55 (216.146 
a user Datagram Protocol, src Port: 61201 (61201), Dst Port: 3478 (3478) 

Q session Traversal utilities for nat 
fResponse in: 21 

a Message Type: 0x0001 (Binding Request) 

Message Length: 16 
Message cookie: 2112a442 

Message Transaction id: c00162f613d45a3aclfd0100 
a Attributes 
a SOFTWARE 

a Attribute Type: software (0x8022) 


1.= Attribute Type comprehension: 0x0001 

[Optional (1)] 

.0. = Attribute Type Assignment: 0x0000 


[IETF Review (0)] 
Attribute Length: 12 
Software: pinath-1.6 


Figure 7-9 A STUN binding request. The request contains a 96-bit transaction ID and the SOFT¬ 
WARE attribute that identifies the client making the request. The attribute contains 10 
characters, but this value is rounded up to the next multiple of 4, giving an attribute 
value of 12. The message length of 16 also includes the 4 bytes used to include the attri¬ 
bute's type and length (the STUN header is not included). 


The sample STUN binding requesf in Figure 7-9 is inifiafed from a clienf. 
The fransacfion ID has been selecfed randomly, and fhe requesf is senf fo numb 
•viagenie.ca (wifh IPv4 addresses 216.146.46.55 and 216.146.46.59), which is 
bofh a STUN and a TURN server (see Secfion 7.4.4). The requesf confains fhe 
SOFTWARE affribufe fhaf idenfifies fhe clienf applicafion. In fhis case, fhe requesf 
was inifiafed by p j nath-1.6. This is fhe "PJSIP NAT helper" applicafion included 
wifh pjsua. The message lengfh includes 4 byfes for fhe affribufe fype and lengfh, 
plus 12 byfes used fo hold fhe affribufe. The lengfh of pj nath-1.6 is only 10 byfes, 
buf affribufe lengfhs are always rounded up fo fhe nearesf 4-byfe mulfiple. Affer 
passing fhrough a NAT, fhe response is given as shown in Figure 7-10. 
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stun-binding.tr - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony Tools Help 

SI W ai « a( I B 0 X 3 S I ^ 4 « ? ^ llBjS 3 


No. Time Protocol Source Destination Info 


1 0.000000 STUN 

10.59.1.37 

216.146.46.55 Binding Request 

2 0.069934 STUN 

216.146.46. 55 

10.59.1.37 Bindinq success Response mapped-address: 71.134.182.214:33294 | 


> 


m Frame 2: 110 bytes on wire (880 bits), 110 bytes captured (880 bits) 

B Ethernet II, Src: 00:80:48:51:f5:b8 (00:80:48:51:f5:b8), Dst: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d:91) 
B internet Protocol, src: 216.146.46.55 (216.146.46.55), Dst: 10.59.1.37 (10. 59.1.37) 

B User Datagram Protocol, src Port: 3478 (3478), Dst Port: 61201 (61201) 

□ session Traversal utilities for nat 
fReauest in: 11 
[Time: 0.069934000 seconds] 

B Message Type: 0x0101 (Binding success Response) 

Message Length: 48 
Message cookie: 2112a442 

Message Transaction ID: c00162f613d45a3aclfd0100 
Q Attributes 

Q MAPPED-ADDRESS: 71.134.182.214:33294 
B Attribute Type: mapped-address (0x0001) 


0. = Attribute Type comprehension: 0x0000 

[Required (0)] 

.0. = Attribute Type Assignment: 0x0000 


[IETF Review (0)] 

Attribute Length: 8 
Reserved: I 

Protocol Family: IPv4 (OxOl) 

Port: 33294 

IP: 71.134.182.214 (71.134.182.214) 

S XOR-MAPPED-ADDRESS: 71.134.182.214:33294 

B Attribute Type: XOR-MAPPED-ADDRESS (0x0020) 


0. = Attribute Type comprehension: 0x0000 

[Required (0)] 

.0. = Attribute Type Assignment: 0x0000 


[IETF Review (0)] 

Attribute Length: 8 
Reserved: 1 

Protocol Family: IPv4 (0x01) 

Port (xoR-d) : a31c 
[Port: 33294] 

IP (XOR-d): 66941294 
[IP: 71.134.182.214 (71.134.182.214)] 
a response-origin: 216.146.46.55:3478 

B Attribute Type: response-origin (0x802b) 


1. = Attribute Type comprehension: 0x0001 

[optional ( 1 )] 

.0. = Attribute Type Assignment: 0x0000 


[IETF Review (0)] 

Attribute Length: 8 
Reserved: 1 

Protocol Family: IPv4 (0x01) 

Port: 3478 

IP: 216.146.46.55 (216.146.46. 55) 

B OTHER-ADDRESS: 216.146.46.59:3479 

B Attribute Type: OTHER-ADDRESS (0x802c) 


1. = Attribute Type comprehension: 0x0001 

[optional (1)] 

.0. = Attribute Type Assignment: 0x0000 


[IETF Review (0)] 

Attribute Length: 8 
Reserved: 1 

Protocol Family: IPv4 (0x01) 

Port: 3479 

ip: 216.146.46.59 (216.146.46. 59) 


Figure 7-10 A STUN binding response containing four attributes. The MAPPED-ADDRESS and XOR- 
MAPPED-ADDRESS attributes contain the server-reflexive addressing information. The other 
attributes are used with an experimental NAT behavior discovery mechanism [RFC5780]. 
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The binding response shown in Figure 7-10 gives useful information to the client, 
encoded as a collection of attributes. The MAPPED-ADDRESS and XOR-MAPPED 
address attributes indicate that the STUN server determined the server-reflexive 
address of 71.134.182.214:33294. The RESPONSE-ORIGIN and OTHER-ADDRESS 
attributes are used by an experimental facility for discovering NAT behavior 
[REC5780]. The first gives the communication endpoint used to send the STUN 
message (216.146.46.55:3478, which matches the sending IPv4 address and UDP 
port number). The second attribute indicates which source IPv4 address and port 
number (216.146.45.59:3479) would have been used if the client requested "change 
address" or "change port" behavior. This latter attribute is equivalent to the now- 
deprecated CHANGED-ADDRESS attribute in classic STUN. If a change address or 
port is specified in a request, a cooperating STUN server attempts to use a different 
address when responding to the client, if possible. 

STUN can be used to perform address fixing as well as a number of other 
functions called mechanisms, including DNS discovery, a method to redirect to an 
alternate server, and message integrity exchanges. Mechanisms are selected in 
the context of a particular STUN usage, so in general they are considered optional 
STUN features. One of the more important mechanisms provides authentication 
and message integrity. It has two forms: the short-term credential mechanism and the 
long-term credential mechanism. 

Short-term credentials are intended to last for a single session; the particular 
duration is defined by the STUN usage. Long-term credentials last across sessions; 
they correspond to a login ID or account. Short-term credentials are often used 
in particular message exchanges, and long-term credentials are used when some 
particular resource is to be allocated (e.g., with TURN; see Section 7.4.4). Pass¬ 
words are never sent in the clear where they could be intercepted. 

The short-term credential mechanism uses the USERNAME and MESSAGE- 
INTEGRITY attributes. Both are required on any request. The USERNAME gives 
an indication of which credentials are required and allows the message sender to 
use the appropriate shared password in forming an integrity check on the mes¬ 
sage (a MAG computed on the message contents; see Ghapter 18). When using 
short-term credentials, it is assumed that some form of credential information 
(e.g., user name and password) has been exchanged earlier. The credential is used 
for forming an integrity check on STUN messages that is encoded in the MES¬ 
SAGE-INTEGRITY attribute. The ability to form a valid MESSAGE-INTEGRITY 
attribute value is an indication that the sender holds a current ("fresh") copy of the 
appropriate credential. 

The long-term credential mechanism ensures freshness in a different way, 
using a digest challenge. When using this mechanism, a client initially makes a 
request without any authentication material. The server rejects the request but pro¬ 
vides a REALM attribute in response. This can be used by the client to determine 
which credential is needed to provide adequate authentication, as the client may 
have credentials for various services (e.g., multiple VoIP accounts). Along with the 
REALM, the server provides a never-reused NONGE value, which the client uses in 
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forming a subsequent request. This mechanism also uses a MESSAGE-INTEGRITY 
attribute, but its integrity function is computed by including the NONGE value. 
Thus, it is difficult for an eavesdropper that overheard a previous long-term creden¬ 
tial exchange to simply replay a validated request (i.e., because the NONGE value 
is different). The use of NONGE values in authentication and related concerns are 
discussed in more detail in Ghapter 18. The long-term credential mechanism can¬ 
not be used to protect STUN indications, as these transactions do not operate as 
request/response pairs. 

7.4.4 Traversal Using Relays around NAT (TURN) 

Traversal Using Relays around NAT (TURN) [REG5766] provides a way for two or 
more systems to communicate even if they are located behind relatively uncoop¬ 
erative NATs. As a last-resort method to support communication in such circum¬ 
stances, it involves a relay server that shuttles data between systems that could 
otherwise not communicate. Using extensions to STUN and some TURN-specific 
messages, it supports communication even when most other approaches have 
failed, provided a common server that is not behind a NAT can be reached by each 
client. If all NATs were compliant with the BEHAVE specifications, TURN would 
not be necessary. Direct communication methods (i.e., that do not use TURN) are 
almost always preferable to using TURN servers. 

Referring to Eigure 7-11, a TURN client behind a NAT contacts a TURN server, 
usually on the public Internet, and indicates the other systems (called peers) with 
which it wishes to communicate, binding the server's address and the appropriate 
protocol to use for communication is accomplished using a special DNS NAPTR 
record (see Ghapter 11 and [REG5928]) or by manual configuration. The client 
obtains address and port information, called the relayed transport address, from the 
server, which are the address and port number used by the TURN server to com¬ 
municate with the peers. The client also obtains its own server-reflexive transport 
address. Peers also have server-reflexive transport addresses that represent their 
external addresses. These addresses are needed by the client and server to perform 
the "plumbing" necessary to interconnect the client and its peers. The method 
used to exchange this addressing information is not defined within the scope of 
TURN. Instead, this information must be exchanged using some other mechanism 
(e.g., IGE; see Section 7.4.5) in order for TURN servers to be used effectively. 

The client uses TURN commands to create and maintain allocations on the 
server. An allocation resembles a multiway NAT binding and includes the (unique) 
relayed transport address that each peer can use to reach the client. Server/peer 
data is sent using straightforward TURN messages traditionally carried in UDP/ 
IPv4. Enhancements support TGP [REG6062] and IPv6 (and also relaying between 
IPv4 and IPv6) [RPG6156]. Server/client data is encapsulated with an indication 
of corresponding peer(s) that sent or should receive the associated data. The cli¬ 
ent/server connection has been specified for UDP/IPv4, TGP/IPv4, and TGP/IPv4 
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Relayed Transport Address 



Figure 7-11 Based on [RFC5766], a TURN server helps clients behind "bad" NATs to communicate by relay¬ 
ing traffic. Traffic flowing between client and server may use TCP, UDP, or TCP with TLS. Traffic 
between the server and one or more peers uses UDP. Relaying is a last-resort measure for com¬ 
munication; direct methods are preferred if available. 


with TLS. Establishing an allocation requires the client to be authenticated, usu¬ 
ally using the STUN long-term credential mechanism. 

TURN supports two methods for copying data between a client and its peers. 
The first encodes data using STUN methods called Send and Data, defined in 
[RFC5766], which are STUN indicafors and fherefore nof aufhenficafed. The ofher 
uses a TURN-specific concepf called channels. Channels are communicafion pafhs 
befween a clienf and a peer fhaf have less overhead fhan fhe Send and Dafa mefh- 
ods. Messages carried over channels use a smaller, 4-byfe header fhaf is incompaf- 
ible wifh fhe larger STUN-formaffed messages ordinarily used by TURN. Up fo 
16K channels can be associafed wifh an allocafion. Channels were developed fo 
help some applicafions such as VoIP fhaf prefer fo use relafively small packefs fo 
reduce lafency and overhead. 

In operafion, fhe clienf makes a requesf fo obfain an allocafion using a TURN- 
defined STUN Allocafe mefhod. If successful, fhe server responds wifh a success 
indicator and fhe allocated relayed fransporf address. A requesf mighf be denied 
if fhe clienf fails fo provide adequate aufhenficafion fo fhe server. The clienf musf 
now send refresh messages fo keep fhe allocafion alive. Allocafions expire in 10 
minutes if nof refreshed, unless fhe clienf included an alfernafe lifefime value, 
encoded as a STUN LIFETIME attribute, in fhe allocafion requesf. Allocafions 
may be deleted by requesfing an allocafion wifh zero lifefime. When an allocafion 
expires, so do all of ifs associafed channels. 
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Allocations are represented using a "5-tuple." At the client, the 5-tuple includes 
the client's host transport address and port number, server transport address and 
port number, and the transport protocol used to communicate with the server. At 
the server, the same 5-tuple is used, except the client's host transport address and 
port are replaced with its server-reflexive address and port. An allocation may 
have zero or more associated permissions, to limit the patterns of connecfivify fhaf 
are permiffed fhrough fhe TURN server. Each permission includes an IP address 
resfricfion such fhaf only packefs wifh fhe mafching source address received af 
fhe TURN server have fheir dafa payloads forwarded fo fhe corresponding clienf. 
Permissions are delefed if nof refreshed wifhin 5 minufes. 

TURN enhances STUN wifh six mefhods, nine affribufes, and six error response 
codes. These can be parfifioned roughly info supporf for esfablishing and mainfain- 
ing allocafions, aufhenficafion, and manipulafing channels. The six mefhods and 
fheir mefhod numbers are as follows: Allocafe (3), Refresh (4), Send (6), Dafa (7), 
CreafePermission (8), and ChannelBind (9). The firsf fwo esfablish and keep allo¬ 
cafions alive. Send and Dafa use STUN messages fo encapsulafe dafa from clienf 
fo server and vice versa, respecfively. CreafePermission esfablishes or refreshes a 
permission, and ChannelBind associafes a parficular peer wifh a 16-bif channel 
number. The error messages indicafe problems wifh TURN feafures such as aufhen¬ 
ficafion failure or running ouf of resources (e.g., channel numbers). The nine STUN 
affribufe names, values, and purposes defined by TURN are given in Table 7-3. 


Table 7-3 STUN attributes defined by TURN 


Name 

Value 

Purpose/Use 

CHANNEL-NUMBER 

OxOOOC 

Indicates what channel associated data belongs to 

LIFETIME 

OxOOOD 

Requested allocation timeout (seconds) 

XOR-PEER-ADDRESS 

0x0012 

A peer's address and port, using XORed encoding 

DATA 

0x0013 

Holds data for a Send or Data indication 

XOR-RELAYED-ADDRESS 

0x0016 

Server's address and port allocated for a client 

EVEN-PORT 

0x0018 

Requests that the relayed transport addressing 
information use an even port; optionally requests 
allocation of the next port in sequence 

REQUESTED-TRANSPORT 

0x0019 

Used in a client to request that a specific transport 
be used in forming the transport address; values are 
drawn from the IPv4 Protocol or IPv6 Next Hop header 
field values 

DONT-PRAGMENT 

OxOOlA 

Requests that the server set the "don't fragment" bit in 
the IPv4 header in packets sent to peers 

RESERVATION-TOKEN 

0x0022 

Unique identifier for a relayed transport address held 
by the server; the value is provided to the client as a 
reference 
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stun-turn.tr - Wireshark 


E®® 


File Edit View Go Capture Analyze Statistics Telephony lools Help 

No, Time Protocol Source Destination InFo 


10.000000 STUN 10.59.1.37 216.146.46.55 Allocate Request UDP 


2 0.071807 STUN 216.146.46.55 10.59.1.37 Allocate Error Response error-code: 401 v 

> 

S Frame 1: 86 bytes on wire (688 bits), 86 bytes captured (688 bits) 

(i Ethernet II, Src: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d:91), Dst: 00:80:48:51:f5:b8 (00:80:48:51:f! 
a internet Protocol, src: 10. 59.1.37 (10. 59.1. 37), Dst: 216.146.46.55 (216.146.46.55) 
a user Datagram Protocol, src Port: 65482 (65482), Dst Port: 3479 (3479) 
a session Traversal utilities for nat 
rp.esponse in: 21 

a Message Type: 0x0003 (Allocate Request) 

Message Length: 24 
Message cookie: 2112a442 

Message Transaction id: 85cd0000f5e769158fbb307a 
a Attributes 

S REQUESTED-TRANSPORT: UDP 

a Attribute Type: requested-transport (0x0019) 

Attribute Length: 4 
Transport: UDP (0x11) 

Reserved: 3 
S SOFTWARE 

a Attribute Type: software (0x8022) 

Attribute Length: 12 
software: pjnath-1.6 

> 


Figure 7-12 A TURN allocation request is a STUN message using message type 0x0003. This request 
also includes the REQUESTED-TRANSPORT and SOFTWARE attributes. It does not 
include authentication information. According to STUN long-term credentials, this 
request will fail. 


A TURN request takes the form of a STUN message whose message type is 
an allocation request. Figure 7-12 shows an example. According to the STUN long¬ 
term credential mechanism, the initial allocation request shown in Figure 7-12 did 
not include authentication information, so it is rejected by the server. The rejection 
is indicated by an allocation error response, shown in Figure 7-13. 

The error message in Figure 7-13 provides the REALM attribute (viagenie. 
ca) and the NONCE value the client requires to form ifs nexf requesf. The mes¬ 
sage also includes fhe MESSAGE-INTEGRITY affribufe so fhe clienf can check fhaf 
fhe message has nof been modified and fhe requesfed REALM and NONCE are 
correcf. A subsequenf requesf includes fhe USERNAME, NONCE, and MESSAGE- 
INTEGRITY affribufes. See Eigure 7-14. 

Affer receiving fhe requesf including long-ferm credenfials, as shown in Eig¬ 
ure 7-14, fhe server compufes ifs own version of fhe message infegrify value and 
compares fhe resulf againsf fhe MESSAGE-INTEGRITY affribufe value. If fhey 
mafch, fhis is sufficienf informafion for fhe TURN server fo conclude fhaf fhe cli¬ 
enf musf hold fhe appropriafe password. If fhen permifs fhe allocafion and indi¬ 
cafes fhe resulf fo fhe clienf (see Eigure 7-15). 
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stun-turn.tr - Wireshark 


E®® 


File Edit View Go Capture Analyze Statistics Telephony Tools Help 

No, Time Protocol Source Destination InFo ^ 

1 0.000000 STUN 10.59.1.37 216.146.46.5 5 Allocate Request UDP 


2 0.071807 STUN 216.146.46.55 10.59.1.37 Allocate Error Response error-code: 401 


> 

S Frame 2: 166 bytes on wire (1328 bits), 166 bytes captured (1328 bits) 

12 Ethernet II, src: 00:80:48:51:f5:b8 (00:80:48:51:f5:b8), Dst: 00:17:f2:e7:6d:91 (00:17:f2:e7:6( 
a internet Protocol, src: 216.146.46. 55 (216.146.46.55), Dst: 10.59.1.37 (10.59.1.37) 
a User Datagram Protocol, src Port: 3479 (3479), Dst Port: 65482 (65482) 
a Session Traversal utilities for NAT 
fRequesT in: 11 
[Time: 0.071807000 seconds] 
a Message Type: 0x0113 (Allocate Error Response) 

Message Length: 104 
Message cookie: 2112a442 

Message Transaction id: 85cd0000f5e769158fbb307a 
S Attributes 

0 ERROR-CODE 401 (Unauthorized): unauthorized 
a Attribute Type: error-code (0x0009) 

Attribute Length: 16 
Reserved: 2 

.100 = Error class: 4 

Error code: 1 

Error Reason Phrase: unauthorized 
a REALM: viagenie.ca 

a Attribute Type: realm (0x0014) 

Attribute Length: 11 
Realm: viagenie.ca 
Padding: 1 

a nonce: TwQMTAAAAAA=RXlPkBMrwMKgYyteh6zuxlwRiu8= 
a Attribute Type: NONCE (0x0015) 

Attribute Length: 40 

Nonce: TwQMTAAAAAA=RXlPkBMrwMKgYyteh6zuxlwRlu8= 
a MESSAGE-INTEGRITY 

a Attribute Type: message-integrity (0x0008) 

Attribute Length: 20 

HMAC-SHAl: 0e06d5bbaac5154119ad5a000e36399abe7685ea 

> 


Figure 7-13 A TURN allocation error response includes the ERROR-CODE attribute with value 
401 (Unauthorized). The message is integrity-protected and includes the REALM and 
NONCE attributes required by the client in forming another, authenticated allocation 
request. 


The allocation request is successful, as shown in Figure 7-15, and the relayed 
transport address is 216.146.46.55:49261 (note that Wireshark performs fhe XOR 
operafion fo display fhe decoded address). Af fhis poinf, fhe clienf can proceed 
fo use fhe TURN server for relaying fo peers. Once fhis is finished, fhe allocafion 
can be removed. Abouf 4s lafer, packefs 5 and 6 in Figure 7-15 indicafe fhe clienf's 
requesf fo remove fhe allocafion. The requesf is expressed as a refresh wifh life- 
fime sef fo 0. The server responds wifh a success indicator and removes fhe alloca¬ 
fion. Nofe fhaf fhe BANDWIDTH affribufe has been included in fhe allocafion and 
refresh success indicafors. This affribufe, defined by a draff version of [RFC5766] 
buf ulfimafely deprecated, was infended fo hold fhe peak bandwidfh, in kilobytes 
per second, permiffed on fhe allocafion. This affribufe may be redefined in fhe 
fufure. 
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5 stun-turn.tr - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony lools Help 

StWIdiOliK @13. (±tQ.^D iiiyi^$i:iSI 

No. Time Protocol Source Destination Info 


3 0.072602 STUN 10.59.1.37 216.146.46.5 5 Allocate Request UDP user: kfall.sip®gmai1. com 


4 0.137041STUN216.146.46.5510.59.1.37Allocate Success Response XOR-RELAYED-ADDRESS: 

5 3.878848 STUN 10.59.1.37 216.146.46.55 Refresh Request lifetime: 0 user: kfall.sip©gmi 

6 3.942432 STUN 216.146.46.5 5 10. 59.1.37 Refresh Success Response lifetime: 0 bandwidth v 

> 

Q Frame 3: 194 bytes on wire (1552 bits), 194 bytes captured (1552 bits) 

B Ethernet II, src: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d:91), Dst: 00:80:48:51:f5:b8 (00:80:48:51:f5:b8: 
B internet protocol, src: 10.59.1.37 (10.59.1.37), DSt: 216.146.46.55 (216.146.46.55) 

S User Datagram Protocol, src Port: 65482 (65482), Dst Port: 3479 (3479) 

Q session Traversal Utilities for NAT 
rpuplicated original message in: Ql 
a Message Type: 0x0003 (Allocate Request) 

Message Length: 132 
Message cookie: 2112a442 

Message Transaction ID: 85cd0000133bee5291bb307a 
a Attributes 

a requested-transport: udp 

B Attribute Type: REQUESTED-TRANSPORT (0x0019) 

Attribute Length: 4 
Transport: udp (0x11) 

Reserved: 3 
a SOFTWARE 

B Attribute Type: software (0x8022) 

Attribute Length: 12 
software: pjnath-1.6 
a USERNAME: kfall.sip@gmail.com 

a Attribute Type: username (0x0006) 

Attribute Length: 19 
Username: kfall.sip®gmail.com 
Padding: 1 

a REALM: viagenie.ca 

B Attribute Type: realm (0x0014) 

Attribute Length: 11 
Realm: viagenie.ca 
Padding: 1 

a nonce: TWQMTAAAAAA=RXIPkBMrWMKgYyteh 62 UXlwRIu 8 = 

B Attribute Type: NONCE (0x0015) 

Attribute Length: 40 

Nonce: TwQMTAAAAAA=RXIPkBMrWMKgYyteh 62 UXlwRIu 8 = 
a MESSAGE-INTEGRITY 

B Attribute Type: message-integrity (0x0008) 

Attribute Length: 20 

HMAC-SHAl: Ie38ccb57f5dcad4b0574f90bf7b9edaaaba7fld 

> 


Figure 7-14 A second TURN allocation request includes the USERNAME, REALM, NONCE, and 
MESSAGE-INTEGRITY attributes. These are used by the server to verify integrity of the 
message and the identity of the client. If successful, the server authenticates the request 
and performs the allocation. 


As suggested previously, TURN has the disadvantage that traffic musf be 
relayed fhrough fhe TURN server, and fhis can lead fo inefficienf roufing (i.e., 
fhe TURN server may be far away from a clienf and peer fhaf are proximal). In 
addifion, cerfain ofher fraffic confenfs are nof passed fhrough from peer fo clienf 
using TURN. This includes ICMP values (see Chapfer 8), TTL {Hop Limit) field 
values, and IP DS Field values. Also, a requesfing TURN clienf musf implemenf fhe 
STUN long-ferm credenfial mechanism and have some form of login credenfial or 
accounf assigned by fhe TURN server operafor. This helps fo avoid unconfrolled 
use of open TURN servers buf creafes somewhaf greafer configurafion complexify. 
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5^ 5tun-turn.tr - Wireshark 


0le Edit Wew Go Capture Analyze Statistics Telephony Tools Help 




No. 

Time 

Protocol 

Source 

Destination 

Info 



3 0.072602 

STUN 

10.59.1.37 

216.146.46. 55 

Allocate Request UDP user: kfall.sip®gmai1.cc 



ESiTM 






5 3.878848 

STUN 

10. 59.1.37 

216.146.46. 55 

Refresh Request lifetime: 0 user: kfall.sip®c 



6 3.942432 

STUN 

216.146.46. 55 

10.59.1.37 

Refresh success Response lifetime: 0 bandwidt 

V 






> 


B 

Frame 4: 126 

bytes 

on wire (1008 bits), 126 bytes 

captured (1008 bits) 


B 

Ethernet ll, 

src: 

00:80:48:51:f5:b8 (00:80:48:51:f5:b8), Dst: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d: 

93 

B 

Internet Protocol, 

src: 216.146.4e 

..55 (216.146.46. 

55), Dst: 10.59.1.37 (10.59.1.37) 


B 

user Datagram protocol, src Port: 

3479 (3479), DSt 

Port: 65482 (65482) 


a 

session Traversal 

Utilities for NAT 




rPuplicated original message in: Q1 

B Message Type: 0x0103 (Allocate success Response) 

Message Length: 64 
Message cookie: 2112a442 

Message Transaction id: 85cd0000133bee5291bb307a 
Q Attributes 

Q XOR-RELAYED-ADDRESS: 216.146.46.55:49261 

IS Attribute Type: XOR-RELAYED-ADDRESS (0x0016) 

Attribute Length: 8 
Reserved: 1 

Protocol Family: IPv4 (0x01) 

Port (xoR-d): el7f 
[Port: 49261] 

IP (XOR-d): f9808a75 

[IP: 216.146.46. 55 (216.146.46. 5 5)] 

(S LIFETIME 600000 
S BANDWIDTH -1 

(S XOR-MAPPED-ADDRESS: 71.134.182.214:33298 
IS MESSAGE-INTEGRITY 

> 


Figure 7-15 A TURN allocation success response. The message is integrity-protected and includes 
the XOR-RELAYED-ADDRESS attribute, identifying the port and address allocated by 
the TURN server. The allocation is deleted if not refreshed. 


7.4.5 Interactive Connectivity Establishment (ICE) 

Given the large variety of NATs deployed and the various mechanisms that may 
be necessary to traverse them, a generic facility called Interactive Connectivity 
Establishment (ICE) [RFC5245] has been developed to help UDP-based applications 
hosted behind a NAT establish connectivity ICE is a set of heurisfics by which an 
applicafion can perform UNSAE in a relafively predicfable fashion. In ifs oper- 
afion, ICE makes use of ofher protocols such as TURN and STUN. A proposal 
exfends fhe use of ICE to TCP-based applicafions [IDTI]. 

ICE works wifh and exfends "offer/answer" profocols, such as fhe Session 
Descripfion Profocol (SDP) used wifh unicasf SIP connecfion esfablishmenf 
[RFC3264]. These profocols involve an offer of service wifh an accompanying sef 
of service parameters followed by an answer fhaf also includes a sef of selecfed 
opfions. If is increasingly common to find ICE clienfs incorporated info VoIP 
applicafions fhaf use SDP/SIP for esfablishing communicafions. However, in 
such circumsfances, ICE is used for esfablishing NAT fraversal for media sfreams 
(such as fhe audio or video porfion of a call carried using RTP [RFC3550] or SRTP 
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[RFC3711]), while another mechanism, called SIP Outbound [RFC5626], handles 
the SIP signaling information such as who is being called. Although in practice 
ICE has been used primarily with SIP/SDP-based applications, it can also be used 
as a generic NAT traversal mechanism for ofher applicafions. One such example 
is fhe use of ICE (over UDP) wifh Jingle [XEP-0176], defined as an exfension fo fhe 
core Extensible Messaging and Presence Protocol (XMPP) [RFC6120]. 

Ordinarily, ICE works fo esfablish communicafion befween fwo SDP enfifies 
(called agents) by firsf defermining a sef of candidate transport addresses fhaf each 
agenf mighf use for communicafing wifh fhe ofher. Referring fo Figure 7-11, fhese 
addresses could be hosf fransporf, server-reflexive, or relayed addresses. ICE 
may make use of bofh STUN and TURN fo defermine fhe candidafe fransporf 
addresses. ICE fhen orders fhese addresses according fo a priorify assignmenf 
algorifhm. The algorifhm arranges for addresses fhaf provide direcf connecfivify 
fo receive greafer priorify fhan fhose fhaf require dafa relaying. ICE fhen provides 
fhe sef of priorifized addresses fo ifs peer agenf, which engages in a similar behav¬ 
ior. Ulfimafely, fwo agenfs agree on fhe besf sef of usable address pairs and indicafe 
fhe selecfed resulfs fo fhe ofher peer. Deferminafion of which candidafe fransporf 
addresses are available is accomplished using a sequence of checks encoded as 
STUN messages. ICE has several opfimizafions fo decrease fhe lafency of agreeing 
on fhe selecfed candidafe, which are beyond fhe scope of fhis discussion. 

ICE begins by affempfing fo discover all available candidafe addresses. 
Addresses may be locally assigned fransporf addresses (mulfiple if fhe agenf is 
mulfihomed), server-reflexive addresses, or relayed addresses defermined by 
TURN. Affer assigning each address a priorify, an agenf sends fhe priorifized lisf 
fo ifs peer using SDP. The peer performs fhe same operafion, resulfing in each 
agenf having fwo priorifized lisfs. Each agenf fhen forms an idenfical sef of priori¬ 
fized candidate pairs by pairing up fhe fwo lisfs. A sef of checks are performed on 
fhe candidafe pairs in a parficular order fo defermine which addresses will ulfi- 
mafely be selecfed. Generally, fhe priorify ordering prefers candidafe pairs wifh 
fewer NATs or relays. The candidafe pair ulf imafely selecfed is defermined by a 
controlling agent assigned by ICE. The confrolling agenf nominates which valid can¬ 
didafe pairs are fo be used, according fo ifs order of preference. The confrolling 
agenf may fry all pairs and subsequenfly make ifs choice (called regular nomina¬ 
tion) or may use fhe firsf viable pair (called aggressive nomination). A nominafion 
is expressed as a flag in a STUN message referring fo a parficular pair; aggressive 
nominafion is performed by seffing fhe nominafe flag in every requesf. 

Checks are senf as STUN binding requesf messages exchanged befween fhe 
fwo agenfs using fhe addressing informafion being checked. Checks are inifiafed 
by fimer, or scheduled as a resulf of an incoming check from a peer (called a trig¬ 
gered check). Responses arrive in fhe form of STUN binding responses fhaf confain 
addressing informafion. In some circumsfances fhis may reveal a new server- 
reflexive address fo fhe agenf (e.g., because a differenf NAT is used befween agenfs 
from fhe one fhaf was used when fhe candidafe addresses were firsf defermined 
using STUN or TURN servers). Should fhis happen, fhe agenf gains a new address 
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called a peer-reflexive candidate, which ICE adds to the set of candidate addresses. 
ICE checks are integrity-checked using STUN's short-term credential mechanism 
and use the STUN EINGERPRINT attribute. When TURN is used, the ICE cli¬ 
ent uses TURN permissions to limit the TURN binding to the remote candidate 
address of inferesf. 

ICE incorporafes fhe concepf of differenf implemenfafions. Lite implemenfa- 
fions are designed for deploymenf in sysfems fhaf do nof employ NAT. They do 
nof ever acf as a confrolling agenf unless inferacfing wifh anofher Life implemen- 
fafion. They also do nof perform fhe checks menfioned earlier as do full implemen¬ 
fafions. The type of an ICE implemenfafion is indicafed in fhe STUN messages 
if sends. All ICE implemenfafions musf comply wifh STUN [REC5389], buf Life 
implemenfafions will only ever acf as STUN servers. ICE exfends STUN wifh fhe 
affribufes described in Table 7-4. 


Table 7-4 STUN attributes defined by ICE 


Name 

Value 

Purpose/Use 

PRIORITY 

0x0024 

Computed priority of associated candidate address 

USE-CANDIDATE 

0x0025 

Indicates selection of candidate by controlling agent 

ICE-CONTROLLED 

0x8029 

Indicates sender of message is controlled agent 

ICE-CONTROLLING 

0x802A 

Indicates sender of message is controlling agent 


A check is a STUN binding request containing the PRIORITY attribute. The 
value is equal to the value assigned by the algorithm described in Section 4.1.2 
of [RPC5245]. The ICE-CONTROLLING and IGE-GONTROLLED attributes are 
included in STUN requests when the sender is the controlling or controlled agent, 
respectively. A controlling agent may also include a USE-GANDIDATE attribute. 
If present, this attribute indicates which candidate the controlling agent wishes to 
select for subsequent use. 


7.5 Configuring Packet-Fiitering Firewaiis and NATs 

Although NATs frequently require little configuration (unless port forwarding is 
being used), firewalls usually do, and sometimes they require extensive configu¬ 
ration. In most home networks the same device is providing NAT, IP routing, and 
firewall capabilities and may require some configuration. Although the configu¬ 
ration is logically separate for each of these, they are sometimes merged, either in 
configuration files, command-line interfaces, Web page controls, or other network 
management tools. 
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7.5.1 Firewall Rules 

A packet-filtering firewall must be given a set of instructions indicating criteria 
for selecting traffic to be dropped or forwarded. Nowadays when configuring a 
router, the network administrator usually configures a set of one or more ACLs. 
Each ACL consists of a list of rules, and each rule typically contains pattern-match¬ 
ing criteria and an action. The matching criteria generally allow the rule to express 
the values of packet fields at either the network or transport layer (e.g., source 
and destination IP addresses, port numbers, ICMP type field, etc.) and a direction 
specification. The direction pattern matches traffic in a direction-dependent man¬ 
ner and allows for a different set of rules to apply for incoming versus outgoing 
traffic. Many firewalls also allow the rules to be applied at a certain point in the 
order of processing within the firewall. Examples of this include the ability to 
specify an ACL to be checked prior to or after the IP routing decision process. In 
some circumstances (especially when more than one interface is used), this flex¬ 
ibility becomes important. 

When a packet arrives, the matching criteria in the appropriate ACL are con¬ 
sulted in order. Por most firewalls, the first matching rule is acted upon. Typical 
actions include a specification to block or forward the traffic and may also adjust 
a counter or write a log entry. Some firewalls may include additional features as 
well, such as having some packets directed to applications or other hosts. Each 
firewall vendor usually has its own method for specifying rules, although Cisco 
Systems' ACL format has emerged as a popular format supported by many ven¬ 
dors of enterprise-class routers. ACLs for home users are typically configured 
using a simple Web interface. 

One of the popular systems for building firewalls is included with modern 
versions of Linux and is called iptables, built using a network filtering capa¬ 
bility called NetPilter [NPWEB]. It is the evolution of an earlier facility called 
ipchains and provides stateless and stateful packet-filtering support as well as 
NAT and NAPT. We shall examine how it works to get a better understanding of 
the types of capabilities a firewall and modern NAT provide. 

iptables includes the concepts of filter tables and filter chains. A table con¬ 
tains several predefined chains and may contain zero or more user-defined 
chains. Three predefined tables are named as follows: filter, nat, and mangle. 
The default filter table is for basic packet filtering and contains the predefined 
chains INPUT, PORWARD, and OUTPUT. These actions correspond to packets 
destined for programs running on the firewall router itself, those passing through 
it while being routed, and those originating at the firewall machine. The nat table 
contains the chains PREROUTING, OUTPUT, and POSTROUTING. The mangle 
table has all five chains. It is used for arbitrary rewriting of packets. 

Each filter chain is a list of rules, and each rule has matching criteria and an 
action. The action (called a target) may be to execute a special user-defined chain 
or to perform one of the following predefined actions: AGGEPT, DROP, QUEUE, 
and RETURN. A packet matching a rule with one of these targets is immediately 
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acted on. ACCEPT (DROP) means the packet is forwarded (dropped). QUEUE 
means the packet is delivered to a user program for arbifrary processing, and 
RETURN means fhaf processing confinues in a previously invoked chain, which 
forms a sorf of packef filfer chain subroufine call. 

The design of a complefe firewall configurafion can be fairly complex and is 
specific fo fhe needs of parficular users and fhe fypes of services fhey require, so 
we will nof affempf fo give one here. Insfead, fhe following examples illusfrafe 
only a small number of fhe possible uses for iptables. The following gives an 
example Linux firewall configurafion file. If is invoked by a shell such as bash: 


EXTIF="extO" 

INTIF="ethO" 

LOOPBACK_INTFRFACF="lo" 

ALL="0.0.0.0/0" # matches all 

# set default filter table policies to drop 
iptables -P INPUT DROP 

iptables -P OUTPUT DROP 
iptables -P FORWARD DROP 

# all local traffic OK 

iptables -A INPUT -i $LOOPBACK_INTFRFACF -j ACCEPT 
iptables -A OUTPUT -i $LOOPBACK_INTERFACE -j ACCEPT 

# accept incoming DHCP requests on internal interface 
iptables -A INPUT -i $INTIF -p udp -s 0.0.0.0 \ 

--sport 67 -d 255.255.255.255 --dport 68 -j ACCEPT 

# drop unusual/suspect TCP traffic with no flags set 
iptables -A INPUT -p tcp --tcp-flags ALL NONE -j DROP 


This example illusfrafes some of fhe flexibilify one can employ in setting up a 
filfer criferia lisf. Inifially, fhe chains are given a defaulf policy (-P opfion), which 
affecfs packefs fhaf fail fo mafch any rules. Nexf, fraffic fo or from fhe local com- 
pufer (which is delivered using fhe pseudo interface lo) is given fo fhe ACCEPT 
fargef (i.e., if is allowed) for fhe INPUT and OUTPUT chains in fhe defaulf filter 
fable. The -j opfion indicafes "jump" fo a parficular processing fargef. Nexf, 
incoming UDP broadcasf fraffic originafing from IPv4 address 0.0.0.0 and des- 
fined for local/subnef broadcasf using fhe DHCP porf numbers (67, 68) is allowed 
in via fhe infernal inferface. Nexf, fhe Flags fields of incoming TCP segmenfs (see 
Chapfer 13) is ANDed wifh all Is (ALL) and compared againsf zero (NONE). A 
mafch occurs only if all fhe Flags fields are 0, which is nof a very useful TCP seg- 
menf (ordinarily all TCP segmenfs affer fhe firsf one confain a valid ACK bif, and 
fhe firsf one confains a SYN). 

While synfax illusfrafed by fhis example is specific fo fhe iptables facilify, 
ifs capabilifies are nof. Mosf filfering firewalls are capable of performing similar 
fypes of checks and acfions. 
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7.5.2 NAT Rules 

In most simple routers, NAT can be configured in conjunction with firewall rules. 
In basic Windows systems, NAT is called Internet Connection Sharing (ICS), and in 
Linux it is called IP masquerading. On Windows XP, for example, ICS has a number 
of special characteristics. It assigns the "internal" IPv4 address as 192.168.0.1 to the 
machine running ICS and starts a DHCP server and DNS server. Other computers 
are assigned addresses in the 192.168.0/24 subnet, with the ICS machine as DNS 
server. Therefore, ICS should not be enabled on networks where these services are 
already being provided by another computer or router, or where the addresses 
might conflict. A registry setting can be used to change the default address range. 

Enabling ICS for an Internet connection on Windows XP can be accomplished 
by using the Network Setup Wizard, or by changing the Advanced properties on 
an already-operating Internet connection (under Settings I Network Connections). 
At this point, the user may also decide to allow other users to control or disable the 
shared Internet connection. This facility, known as Internet Gateway Device Discov¬ 
ery and Control (IGDDC), uses the Universal Plug and Play framework, described 
in Section 7.5.3, for controlling a local Internet gateway from a client. The functions 
supported include connect and disconnect, along with reading various status mes¬ 
sages. The Windows firewall facility, which works in conjunction with ICS, sup¬ 
ports the creation of service definitions. Service definitions are equivalent to port 
forwarding, as defined previously. To enable it, the Advanced property tab on the 
Internet connection is selected and a new service may be added (or an existing 
one edited). The user is then given the opportunity to fill in the appropriate TCP 
and UDP port numbers, both at the external interface and at the internal server 
machine. It thus works as a way to configure NAPT for incoming connections. 

As with Windows, Linux combines the masquerade capability with its fire¬ 
wall implementation. The following script configures masquerading in a simple 
manner. Note that this script is only for illustration and is not recommended for 
production use. 

EXTIF="extO" 

echo "Default FORWARD policy: DROP" 
iptables -P FORWARD DROP 

echo "Enabling NAT on $EXTIF for hosts 192.168.0.0/24" 
iptables -t nat -A POSTROUTING -o $EXTIF -s 192.168.0.0/24 \ 

-j MASQUERADE 

echo "FORWARD policy: DROP unknown traffic" 

iptables -A INPUT -i $EXTIF -m state --state NEW,INVALID -j DROP 
iptables -A FORWARD -i $EXTIF -m state --state NEW,INVALID -j DROP 


Here, the default policy for the FORWARDING chain in the filter table is 
set to DROP. The next item arranges for hosts with IPv4 addresses assigned from 
the 192.168.0.0/24 subnet to have their addresses rewritten for any IPv4 traffic (via 
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NAT, implemented by the nat table and -t nat options) after routing has deter¬ 
mined the external interface fo be fhe appropriafe one. Because of fhe sfafeful way 
fhaf NAT works, if is now possible fo adjusf fhe filter fable's rules fo allow only 
fraffic associafed wifh a connecfion known fo NAT. The lasf fwo lines adjusf fhe 
INPUT and FORWARD chains so fhaf any incoming fraffic fhaf is eifher invalid or 
unknown (NEW) is dropped. The special operators NEW and INVALID are defined 
wifhin fhe iptables command. 

7.5.3 Direct Interaction with NATs and Firewalls: UPnP, NAT-PMP, and PCP 

In many cases, a client system wishes to or needs to interact directly with its fire¬ 
wall. For example, a firewall may need to be configured or reconfigured for dif¬ 
ferent services by allowing traffic destined for a particular port to not be dropped 
(establishing a "pinhole"). In cases where a proxy firewall is in use, each client 
must be informed of the proxy's identity. Otherwise, communication beyond the 
firewall is not possible. A number of protocols have been developed for support¬ 
ing communication between clients and firewalls. The two most prevalent ones 
are called Universal Plug and Play (UPnP) and the NAT Port Mapping Protocol (NAT- 
PMP). The standards for UPnP are developed by an industry group called the 
UPnP Forum [UPNP]. NAT-PMP is currently an expired draft document within 
the IETF [XIDPMP]. NAT-PMP is supported by most Mac OS X systems. UPnP 
has native support on Windows systems and can be added to Mac OS and Linux 
systems. UPnP is also used in support of consumer electronics device discovery 
protocols for home networks being developed by the Digital Living Network Alli¬ 
ance (DLNA) [DLNA]. 

With UPnP, controlled devices are configured with IP addresses based first 
upon DHCP and using dynamic link-local address configuration (see Chapter 6) 
if DHCP is not available. Next, the Simple Service Discovery Protocol (SSDP) [XIDS] 
announces the presence of the device to control points (e.g., client computers) and 
allows the control points to query the devices for additional information. SSDP 
uses two variants of HTTP with UDP instead of the more standard TCP. They are 
called HTTPU and HTTPMU [XIDMU], and the latter uses multicast addressing 
(IPv4 address 239.255.255.250, port 1900). For SSDP carried on IPv6, the following 
addresses are used: ff01::c (node-local), ff02::c (link-local), ff05::c (site-local), ff08::c 
(organization-local), and ff0e::c (global). 

Subsequent control and event notification ("eventing") is controlled by the 
General Event Notification Architecture (GENA), which uses the Simple Object Access 
Protocol (SOAP). SOAP supports a client/server remote procedure call (RPC) mecha¬ 
nism and uses messages encoded in the Extensible Markup Language (XML), which is 
commonly used tor Web pages. UPnP is used tor a wide variety of consumer elec¬ 
tronic devices, including audio and video playback and storage devices. NAT/fire- 
wall devices are controlled using the Internet Gateway Device (IGD) protocol [IGD]. 
IGD supports a variety of capabilities, including the ability to learn NAT mappings 
and configure port forwarding. The interested reader may obtain a simple IGD 
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client useful for experimenfafion from fhe MiniUPnP Projecf HomePage [UPNPC]. 
A second version of UPnP IGD [IGD2] adds general IPv6 supporf fo UPnP 

While UPnP is a broad framework fhaf includes NAT confrol and several ofher 
unrelafed specificafions, NAT-PMP provides an alfernafive specifically fargefed af 
programmafic communicafions wifh NAT devices. NAT-PMP is parf of Apple's sef 
of Bonjour specificafions for zero configurafion nefworking. NAT-PMP does nof 
use a discovery process, as fhe device being managed is usually a sysfem's defaulf 
gafeway as learned by DHGP NAT-PMP uses UDP porf 5351. NAT-PMP supporfs 
a simple requesf/response profocol for learning a NAT's oufside address and con¬ 
figuring porf mappings. If also supporfs a basic evenfing mechanism fhaf nofifies 
lisfeners when a NAT oufside address changes. This is accomplished using a UDP 
mulficasf message senf fo address 224.0.0.1 (fhe All Hosfs address) when fhe ouf¬ 
side address changes. NAT-PMP uses UDP porf 5350 for clienf/server inferacfions 
and 5351 for mulficasf evenf nofificafion. The idea of NAT-PMP can be exfended 
for use wifh SPNAT, as proposed by fhe Port Control Protocol (PGP) [IDPGP]. 


7.6 NAT for IPv4/IPv6 Coexistence and Transition 

Wifh fhe deplefion of fhe lasf fop-level unicasf IPv4 address prefixes in early in 
2011, fhe embracing of IPv6 is beginning fo accelerafe. If was fhoughf fhaf hosfs 
could be equipped wifh dual-sfack funcfionalify (i.e., each implemenfs a complefe 
IPv4 and IPv6 slack) [RFG4213] and nefwork services would fransifion over fo 
IPv6-only operafion. If is now understood fhaf IPv4 and IPv6 are likely fo coex- 
isf for an exfended period of fime, perhaps indefinitely, and fhaf for various eco¬ 
nomic reasons nefwork infrasfrucfure may operate using eifher IPv4 or IPv6 or 
bofh. Assuming fhaf fhis is frue, fhere will be an ongoing need fo supporf com¬ 
municafions befween IPv4 and IPv6 sysfems, whefher fhey are dual-sfack or nof. 
The fwo major approaches fhaf have been used fo supporf combinafions of IPv4 
and IPv6 are funneling and franslafion. The funneling approaches include Teredo 
(see Ghapfer 10), Dual-Sfack Life (DS-Life), and IPv6 Rapid Deploymenf (6rd). 
Alfhough DS-Life involves SPNAT as parf of ifs archifecfure, a purer franslafion 
approach is given by fhe framework described in [RFG6144], which uses fhe IPv4- 
embedded IPv6 addresses we saw in Ghapfer 2. We will discuss bofh DS-Life and 
fhe franslafion framework in more defail in fhis secfion. 

7.6.1 Dual-Stack Lite (DS-Lite) 

DS-Life [RFG6333] is an approach fo make fransifion fo IPv6 (and supporf for 
legacy IPv4 users) easier for service providers fhaf wish fo run IPv6 infernally. In 
essence, if allows providers fo focus on deploying an operafional IPv6 core nef¬ 
work yef provide IPv4 and IPv6 connecfivify fo fheir customers using a small num¬ 
ber of IPv4 addresses. The approach combines IPv4-in-IPv6 "soffwire" funneling 
[RFG5571] wifh SPNAT. Figure 7-16 shows fhe type of deploymenf envisioned. 
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Figure 7-16 DS-Lite allows service providers to support IPv4 and IPv6 customer networks using 
an IPv6-onIy infrastructure. IPv4 address usage is minimized by using SPNAT at the 
provider's edge. 

In Figure 7-16, each customer network operates with any combination of IPv4 
and IPv6. The service provider's network is assumed to be managed using only 
IPv6. Customer access to the IPv6 Internet is provided using conventional IPv6 
routing. For IPv4 access, each customer uses a special "before" gafeway (labeled 
"B4" in Figure 7-16). A B4 elemenf provides basic IPv4 services (e.g., DHCP service, 
a DNS proxy, efc.) buf also encapsulafes fhe cusfomer's IPv4 fraffic in mulfi-poinf- 
fo-poinf funnels ferminafed af fhe "affer" elemenf (labeled "AFTR" in Figure 7-16). 
The AFTR elemenf performs decapsulafion of fraffic headed fo fhe IPv4 Infernef 
and encapsulafion in fhe reverse direcfion. AFTR also performs NAT and acfs as a 
form of SPNAT. More specifically, fhe AFTR may use fhe idenfify of fhe cusfomer's 
funnel endpoinf for disambiguafing fraffic refurning fo fhe AFTR from fhe IPv4 
Infernef. This allows mulfiple cusfomers fo use fhe same IPv4 address space. A B4 
elemenf can learn fhe name of ifs corresponding AFTR elemenf using a DHCPv6 
opfion called AFTR-Name [RFC6334]. 

If is insfrucfive fo recall fhe discussion of IPv6 rapid deploymenf (6rd) from 
Chapfer 6. Whereas DS-Life provides IPv4 access fo cusfomers over a service pro¬ 
vider's IPv6 nefwork, 6rd aims fo provide IPv6 access fo cusfomers over a service 
provider's IPv4 nefwork. In essence, fhey fake opposife approaches wifh similar 
archifecfural componenfs. However, wifh 6rd, mapping from an IPv6 address fo 
fhe address of fhe corresponding IPv4 funnel endpoinf (and vice versa) is com- 
pufed in a sfafeless fashion using an address-mapping algorifhm. Sfafeless address 
franslafion is also used in fhe framework for full profocol franslafion befween 
IPv4 and IPv6, which we discuss nexf. 

7.6.2 IPv4/IPv6 Translation Using NATs and ALGs 

The biggesf disadvanfage of using funneling fechniques for supporfing IPv4/IPv6 
coexisfence is fhaf nefwork services running on hosfs using one address family 
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cannot be reached directly by the hosts using the other. Thus, an IPv6-only host 
can communicate only with other IPv6-capable systems. This is an undesirable sit¬ 
uation because many valuable services offered on fhe legacy IPv4 Infernef would 
remain unavailable fo new sysfems fhaf may supporf only IPv6. To address fhis 
concern, a significanf efforf was underfaken befween 2008 and 2010 fo develop a 
framework fo provide direcf franslafion befween IPv4 and IPv6. This efforf was 
informed by poor experiences wifh NAT-PT [RFC2766], which was ulfimafely 
defermined fo be foo briffle and unscalable for ongoing use and was deprecafed 
[RFC4966]. 

The IPv4/IPv6 franslafion framework is given in [RFC6144]. The basic fransla¬ 
fion archifecfure involves bofh sfafeful and sfafeless mefhods fo converf befween 
IPv4 and IPv6 addresses, franslafions for DNS (see Chapfer 11), and fhe definifion 
of any addifional behaviors or ALGs in cases where fhey are necessary (including 
for ICMP and FTP). In fhis secfion, we will discuss fhe basics of fhe sfafeless and 
sfafeful address franslafion for IP based on [RFC6145], [RFC6146], and fhe address¬ 
ing from [RFC6052] we discussed in Chapfer 2. Ofher profocol-specific franslafion 
issues will be covered in subsequenf chapfers. 

7.6.2.1 IPv4-Converted and IPv4-Translatable Addresses 

In Chapfer 2, we discussed fhe sfrucfure of IPv4-embedded IPv6 addresses. Such 
addresses are IPv6 addresses fhaf can be used as inpuf fo a funcfion fhaf produces 
a corresponding IPv4 address. The funcfion is also easily inverfed. There are fwo 
imporfanf fypes of IPv4-embedded IPv6 addresses, called IPv4-converted addresses 
and IPv4-translatable addresses. Each fype of address menfioned is a subsef of 
fhe ofher fypes. Thaf is, if we freaf each address cafegory as a sef, fhen (IPv4- 
franslafable) <z (IPv4-converfed) <z (IPv4-embedded) <z (IPv6). IPv4-franslafable 
addresses are IPv6 addresses for which an IPv4 address can be defermined in a 
sfafeless fashion (see Secfion 7.6.2.2). 

Algorifhmic franslafion befween IPv4 and IPv6 addresses involves fhe use of 
a prefix, as described in Chapfer 2. The prefix may be eifher fhe Well-Known Pre¬ 
fix (WKP) 64:ff9b::/96 or anofher Nefwork-Specific Prefix fhaf is ordinarily owned 
by a service provider and used specifically wifh ifs franslafors. The WKP is used 
only in represenfing ordinary globally roufable IPv4 addresses; privafe addresses 
[RFC1918] are nof fo be used wifh fhe WKP. In addifion, fhe WKP is nof fo be 
used for creafing IPv4-franslafable addresses. Such addresses are infended fo be 
defined wifhin fhe scope of a provider's nefwork, so if is nof appropriafe fo use 
fhem af a global scope. 

The WKP is inferesfing because if is checksum-neutral wifh respecf fo fhe Infer¬ 
nef checksum. Recall fhe Infernef checksum calculafion from Chapfer 5. If we freaf 
fhe prefix 64:ff9b::/96 as being composed of fhe hexadecimal values 0064, ff9b, 
0000, 0000, 0000, 0000, 0000, 0000, fhe sum of fhese values is ffff, which is equal 
fo 0 in one's complemenf. Consequenfly, when an IPv4 address has fhe WKP pre¬ 
pended, fhe associafed Infernef checksums in packefs creafed as a resulf of frans¬ 
lafion (e.g., in fhe IPv4 header, TCP, or UDP checksum) are unaffecfed. Nafurally, 
an appropriafely chosen Nefwork-Specific Prefix can also be checksum-neufral. 
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In the following two subsections, we will use the notation To4(A6, P) to rep¬ 
resent the IPv4 address derived from IPv6 address A6 in conjunction with prefix 
P. P is either the WKP or some Network-Specific Prefix. We will use the notation 
To6(A4, P) to represent the IPv6 address derived from IPv4 address A4 in conjunc¬ 
tion with prefix P Note that, with a few special exceptions, A6 = To6(To4(A6,P),P) 
and A4 = To4(To6(A4,P),P). 

7.6.2.2 Stateless Translation 

Stateless IP/ICMP Translation (SIIT) refers to a method of translating between IPv4 
and IPv6 packets without using state tables [RFC6145]. The translation is per¬ 
formed without table lookups and uses IPv4-translatable addresses along with 
a defined scheme to translate IP headers. For the most part, IPv4 options are not 
translated (they are ignored), nor are IPv6 extension headers (except the Frag¬ 
ment header). The exception is an unexpired IPv4 Source Route option. If such an 
option is present, the packet is dropped and a corresponding ICMP error message 
(Destination Unreachable, Source Route Failed; see Chapter 8) is generated. Table 
7-5 describes how the IPv6 header fields are assigned when translating an IPv4 
datagram to IPv6. 


Table 7-5 Methods for creating an IPv6 header when translating IPv4 to IPv6 


IPv6 Field 

Assignment Method 

Version 

Set to 6. 

DS Field/ECN 

Copied from same values in IPv4 header 

Flow Label 

Set to 0. 

Payload Length 

Set to IPv4 Total Length minus length of the IPv4 header (including 
options). 

Next Header 

Set to IPv4 Protocol field (or 58 if the Protocol field had value 1). 

Set to value 44 to indicate a Fragment header if the IPv6 datagram 
being created is a fragment or DF bit not set. 

Hop Limit 

Set to the IPv4 TTL field minus 1 (if this value is 0, the packet is 
discarded and an ICMP Time Exceeded message is generated; see 
Chapter 8). 

Source IP Address 

Set to To6(IPv4 Source IP Address, P). 

Destination IP 
Address 

Set to To6(IPv4 Destination IP Address, P). 


During the translation process, the IPv4 header is stripped and replaced with 
an IPv6 header. If the arriving IPv4 datagram is too large to fit in the MTU for the 
next link and the DF bit field in its header is not set, multiple IPv6 fragment packets 
may be produced, each containing a Fragment header. This also occurs when the 
arriving IPv4 datagram is a fragment. [RFC6145] recommends a Fragment header 
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be included in the resulting IPv6 datagram whenever the arriving IPv4 datagram's 
DF bit field has value zero, whether or not the translator needs to perform frag- 
menfafion or fhe arriving dafagram is a fragmenf. This allows fhe IPv6 receiver fo 
know fhaf fhe IPv4 sender was likely nof using PMTUD. When a Fragmenf header 
is included, ifs fields are sef according fo fhe mefhods lisfed in Table 7-6. 


Table 7-6 Methods for assigning fields of the Fragment header, if used, during IPv4-to-lPv6 
translation 


Fragment Header Field 

Assignment Method 

Next Header 

Set to the IPv4 Protocol field. 

Fragment Offset 

Copied from fhe IPv4 Fragment Offset field. 

More Fragments Bit 

Copied from fhe IPv4 More Fragments (M) bit field. 

Identification 

The low-order 16 bits are set from the IPv4 Identification field. 
The high-order 16 bits are set to 0. 


The reverse direction (IPv6-to-IPv4 translation) involves creating an IPv4 
datagram with header field values based on fields in the arriving IPv6 header. 
Obviously the much larger IPv6 address space does not allow an IPv4-only host to 
access every host on the IPv6 Internet. Table 7-7 gives the methods used to assign 
the fields in the outgoing IPv4 datagram's header when an unfragmented IPv6 
datagram arrives. 


Table 7-7 Methods for creating an IPv4 header when translating unfragmented IPv6 to IPv4 


IPv4 Header Field 

Assignment Method 

Version 

Set to 4. 

IHL 

Set to 5 (no IPv4 options). 

DS Field/ECN 

Copied from same values in IPv6 header. 

Total Length 

The value of the IPv6 Payload Length field plus 20. 

Identification 

Set to 0 (with option to set to some other predetermined value). 

Flags 

More Fragments (M) is set to 0. Don't Fragment (DF) is set to 1. 

Fragment Offset 

Set to 0. 

TTL 

The value of the IPv6 Hop Limit field minus 1 (must be at least 1). 

Protocol 

Copied from the first IPv6 Next Header field that does not refer 
to a Fragment header, HOPOPT, IPv6-Route, or IPv6-Opts. 

Value 58 is changed to 1 to support ICMP (see Chapter 8). 

Header Checksum 

Computed for fhe newly created IPv4 header. 

Source IP Address 

To4(IPv6 Source IP Address, P). 

Destination IP Address 

To4(IPv6 Destination IP Address, P). 
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If the arriving IPv6 datagram includes a Fragment header, the outgoing IPv4 
datagram uses field values based on assignment methods modified from those in 
Table 7-7. Table 7-8 gives this case. 


Table 7-8 Methods for creating an IPv4 header when translating fragmented IPv6 to IPv4 


IPv4 Header Field 

Assignment Method 

Total Length 

The value of the IPv6 Payload Length field minus 8 plus 20. 

Identification 

Copied from the low-order 16 bits in the Identification field of the 
IPv6 Fragment header. 

Flags 

More Fragments (M) copied from the M bit field in the IPv6 
Fragment header. Don't Fragment (DF) is set to 0 to allow 
fragmentation in the IPv4 network. 

Fragment Offset 

Copied from the Fragment Offset field of the IPv6 Fragment header. 


In the case of fragmented IPv6 datagrams, the translator produces fragmented 
IPv4 datagrams. Note that in IPv6 the Identification field is larger, so there is a pos¬ 
sibility that certain fragments could fail to be reassembled properly if multiple 
distinct IPv6 datagrams from the same host are fragmented in such a way that the 
Identification field values they use share a common lower-order 16 bits. However, 
this situation is no more risky than having the conventional IPv4 Identification 
field wrap. Furthermore, integrity checks at higher layers make this issue nothing 
much to worry about. 

7.6.2.3 Stateful Translation 

In stateful translation, NAT64 [RFC6146] is used to support IPv6-only clients com¬ 
municating with IPv4 servers. This is expected to be important during the period 
when many important services continue to be offered using only IPv4. The trans¬ 
lation method for headers is nearly identical to the methods described for stateless 
translation in Section 76.2.2. As a NAT, NAT64 complies with BEHAVE specifica¬ 
tions and supports only endpoint-independent mappings, along with both end- 
point-independent and address-dependent filtering. Thus, it is compatible with 
the NAT traversal techniques (e.g., ICE, STUN, TURN) we discussed previously. 
Lacking these additional protocols, NAT64 supports dynamic translation only for 
IPv6 hosts initiating communications with IPv4 hosts. 

NAT64 works much like conventional NAT (NAPT) across address families, 
except translations in the IPv4-to-IPv6 direction are simpler than in the reverse 
direction. A NAT64 device is assigned an IPv6 prefix, which can be used to form a 
valid IPv6 address directly from an IPv4 address using the mechanism described 
in Chapter 2 and [RFC6052]. Because of the comparative scarcity of the IPv4 
address space, translations in the IPv6-to-IPv4 direction make use of a pool of 
IPv4 addresses that are ordinarily managed dynamically. This requires NAT64 to 
support NAPT functionality, whereby multiple distinct IPv6 addresses may map 
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to the same IPv4 address. NAT64 currently defines methods for franslafion of TCP, 
UDP, and ICMP messages inifiafed by IPv6 nodes. (In fhe case of ICMP queries 
and responses, fhe ICMP Identifier field is used insfead of fhe fransporf-layer porf 
number; see Chapfer 8.) 

NAT64 handles fragmenfs differenfly from ifs sfafeful counferparf. For arriv¬ 
ing TCP or UDP fragmenfs where fhe fransporf checksum is nonzero (see Chapfer 
10), fhe NAT64 may eifher queue fhe fragmenfs and franslafe fhem fogefher or 
franslafe fhem individually. A NAT64 musf handle fragmenfs, even fhose arriving 
ouf of order. A NAT64 may be configured wifh a fime limif (af leasf 2s) bounding 
fhe fime during which fragmenfs will be cached. Ofherwise, fhe NAT could be 
subjecf fo a DoS affack resulfing from fhe exhausfion of packef buffers holding 
fragmenfs. 


7.7 Attacks Involving Firewalls and NATs 

Given fhaf fhe primary purpose of deploying firewalls is fo reduce fhe exposure 
fo affacks, if is nof surprising fhaf firewalls have fewer obvious shorfcomings fhan 
end hosfs or roufers. Thaf said, fhey are nof wifhouf fheir faulfs. The mosf com¬ 
mon fypes of firewall problems resulf from incomplefe or incorrecf configurafion. 
Configuring firewalls is nof a frivial fask, especially for large enferprises where 
many services may be employed on a daily basis. Ofher forms of affacks exploif 
fhe weaknesses of some firewalls, including fhe inabilify of many of fhem (espe¬ 
cially older ones) fo deal wifh IP fragmenfs. 

One type of problem arises when a NAT/firewall can be hijacked from ouf side 
fo provide a masquerading capabilify for an affacker. If fhe firewall is configured 
wifh NAT enabled, fraffic arriving af ifs exfernal inferface may be rewriffen so as 
fo appear fo have come from fhe NAT device, fhereby hiding an affacker's acfual 
address. Whaf is worse, fhis is "normal" behavior from fhe NAT's poinf of view; 
if jusf happens fo be geffing ifs inpuf packefs from oufside rafher fhan inside. 
This has been a parficular problem wifh ipchains-based NAT/firewall rules on 
Linux. The simplesf configurafion for seffing up masquerading: 


Linux# ipchains -P FORWARD MASQUERADE 


allows fhis affack fo fake place and is fherefore not recommended. As we can see, 
if sefs fhe defaulf forwarding policy fo masquerade, which pofenfially applies fo 
any IP forwarding. 

Anofher fype of problem fhaf can arise wifh firewall and NAT rules is fhaf 
fhey may be sfale. In parficular, fhey may confain porf forwarding enfries or ofher 
so-called holes fhaf allow fraffic fhrough for services fhaf are no longer used. A 
relafed problem is fhaf some roufers keep more fhan one copy of fhe firewall rules 
in memory, and fhe roufer musf be specifically insfrucfed when fo enable which 
rules. Finally, anofher common configurafion problem is fhaf many roufers merge 
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new firewall rules with the existing set when new ones are added. This can poten¬ 
tially lead to undesired results if the operator is unaware of this behavior. 

The problem with fragments is related to how IP fragments are constructed. 
When an IP datagram is fragmented (see Chapter 10), the transport header, which 
contains the port numbers, appears only in the first fragment and in none of the 
others. This is a direct result of the layering and encapsulation of the TCP/IP pro¬ 
tocol architecture. Unfortunately for a firewall, receiving a fragment other than 
the first provides little information about the transport layer or service to which 
the packet relates. The only obvious way to make this association is to find the 
first fragment (if there ever was one), and this obviously requires a stateful fire¬ 
wall capability, which might be subject to resources exhaustion attacks. Even 
stateful firewalls could fall short: if the first fragment arrives after subsequent 
fragments, the firewall may not be smart enough to perform reassembly prior to 
its filtering operation. In some cases, the firewall simply drops fragments it cannot 
fully identify, which could pose problems for legitimate traffic that happens to use 
large datagrams. 


7.8 Summary 

Firewalls provide a mechanism for network administrators to restrict the flow 
of information that may be harmful to end systems. The two major types of fire¬ 
walls are packet-filtering firewalls and proxy firewalls. Packet-filtering firewalls 
may be further separated into the stateful and stateless varieties, and they usually 
act as IP routers. The stateful variety is more sophisticated and supports success¬ 
ful operation of a wider variety of application-layer protocols (and might do more 
sophisticated logging or filtering across multiple packets in a packet stream). Proxy 
firewalls usually act as a form of application-layer gateway. For these firewalls, 
each application-layer service must have its own proxy handler on the firewall, 
but this does allow handlers to make modifications even to the data portion of the 
transiting traffic. Protocols such as SOCKS support proxy firewalls in a standard¬ 
ized way. 

Network Address Translation (NAT) is a mechanism whereby a relatively 
large number of end hosts can share one or more globally routable IP address(es). 
NAT is used extensively for this purpose but can also be used in conjunction with 
firewall rules to form a NAT/firewall combination. In this popular configuration, 
computers "behind" the NAT are allowed to send traffic out to the global Internet, 
but only traffic returning in response to the outgoing traffic is ordinarily admit¬ 
ted back. This presents a small problem for implementing services behind a NAT 
that is handled by port forwarding, which allows the NAT to pass on incoming 
traffic for a service to end hosts inside the NAT. NAT is also being proposed for 
helping the transition from IPv4 to IPv6 by translating addresses between the two 
realms. In addition, NAT is being considered for use within ISPs to further allay 
IPv4 address depletion concerns. If this happens on a large scale, it may become 
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(even more) difficult for ordinary users to offer Internet services from their home 
networks. 

Some applications use a set of heuristics in order to determine what addresses 
are used on the outside of the NATs they are behind. Many of these operate uni¬ 
laterally, without direct help from the NAT. Such applications are said to use 
UNSAF (pronounced "unsafe") methods and may not be completely reliable. A 
set of documents (developed by the IETF BEFIAVE working group) specifies the 
proper behavior of NATs for different protocols, but not all NATs implement these 
specifications. Consequently, NAT traversal techniques may need to be employed 
to ensure that connectivity can take place. 

NAT traversal involves determining a set of addresses and port numbers that 
can be used to support communications even when one or more NATs must be 
used. STUN is the primary workhorse protocol for determining addresses. TURN 
is a particular STUN usage that relays traffic through a specially configured 
TURN server, usually located in the Internet. Deciding which addresses or relays 
to use can be accomplished using a complete NAT traversal protocol such as ICE. 
ICE determines all possible addresses that can be used between a pair of com¬ 
municating endpoints using local information, plus addresses determined using 
STUN and TURN. It then selects the "best" addresses for subsequent communi¬ 
cation. Mechanisms such as ICE have received the most attention for supporting 
VoIP services that use the SIP protocol for signaling. 

Firewalls and NATs may require configuration. The basic settings are ade¬ 
quate for many home users, but firewalls may require modifications to allow 
certain services to work. In addition, if a user behind a NAT wishes to offer an 
Internet service, port forwarding will likely have to be configured on the NAT 
device. Some applications support configuration by performing direct communi¬ 
cation with a NAT using protocols such as UPnP and NAT-PMP. When supported 
and enabled, these allow a NAT to have its port forwarding and binding data 
accessed and modified by the application automatically, without user interven¬ 
tion. For a home user to run a Web server behind a dynamically provisioned NAT 
(i.e., one with an Internet-facing IP address that changes), additional services such 
as dynamic DNS (see Chapter 11) may also be important. 
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8.1 Introduction 

The IP protocol alone provides no direct way for an end system to learn the fate 
of IP packefs fhaf fail fo make if fo fheir desfinafions. In addifion, IP provides no 
direcf way of obfaining diagnosfic informafion (e.g., which roufers are used along 
a pafh or a mefhod fo esfimafe fhe round-frip fime). To address fhese deficiencies, 
a special protocol called fhe Internet Control Message Protocol (ICMP) [RFC0792] 
[RFC4443] is used in conjuncfion wifh IP fo provide diagnosfics and confrol infor¬ 
mafion related fo fhe configurafion of fhe IP protocol layer and fhe disposifion of 
IP packefs. ICMP is often considered parf of fhe IP layer ifself, and if is required 
fo be presenf wifh any IP implemenfafion. If uses fhe IP protocol for fransporf. 
So, precisely, if is neifher a nefwork nor a fransporf protocol buf lies somewhere 
befween fhe fwo. 

ICMP provides for fhe delivery of error and confrol messages fhaf may require 
affenfion. ICMP messages are usually acfed on by fhe IP layer ifself, by higher- 
layer fransporf protocols (e.g., TCP or UDP), and in some cases by user applica- 
fions. Note fhaf ICMP does nof provide reliabilify for IP. Rafher, if indicafes cerfain 
classes of failures and configurafion informafion. The mosf common cause of 
packef drops (buffer overrun af a roufer) does nof elicif any ICMP informafion. 
Ofher protocols, such as TCP, handle such sifuafions. 

Because of fhe abilify of ICMP fo affecf fhe operafion of imporfanf system 
funcfions and obfain configurafion informafion, hackers have used ICMP mes¬ 
sages in a large number of affacks. As a resulf of concerns abouf such affacks, 
nefwork adminisfrafors offen arrange fo block ICMP messages wifh firewalls, 
especially af border roufers. If ICMP is blocked, however, a number of common 
diagnosfic ufilifies (e.g., ping, fraceroufe) do nof work properly [RFC4890]. 
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When discussing ICMP, we shall use the term ICMP to refer to ICMP in gen¬ 
eral, and the terms ICMPv4 and ICMPv6 to refer specifically to the versions of 
ICMP used with IPv4 and IPv6, respectively. As we shall see, ICMPv6 plays a far 
more important role in the operation of IPv6 than ICMPv4 does for IPv4. 

[RFC0792] contains the official base specification of ICMPv4, which is refined 
and clarified in [RFC1122] and [RFC1812]. [RFC4443] provides the base specifica¬ 
tion for ICMPv6. [RFC4884] provides a method to add extension objects to cer¬ 
tain ICMP messages. This facility is used for holding Multiprotocol Label Switching 
(MPLS) information [RFC4950] and for indicating which interface and next hop 
a router would use in forwarding a particular datagram [RFC5837]. [RFC5508] 
gives standard behavioral characteristics of ICMP through NATs (also discussed 
in Chapter 7). In IPv6, ICMPv6 is used for several purposes beyond simple error 
reporting and signaling. It is used for Neighbor Discovery (ND) [RFC4861], which 
plays the same role as ARP does for IPv4 (see Chapter 4). It also includes the 
Router Discovery function used for configuring hosts (see Chapter 6) and multicast 
address management (see Chapter 9). Finally, it is also used to help manage hand- 
offs in Mobile IPv6. 

8.1.1 Encapsulation in IPv4 and IPv6 

ICMP messages are encapsulated for transmission within IP datagrams, as shown 
in Figure 8-1. 


IPv4 Protoco/field = 1 



IPv4 


IPv6 Next Header field = 58 



IPv6 



IPv6 

Extension 

ICMP 

ICMP Data 

Header 

Headers 
(If Present) 

Header 

(40 bytes) 
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(4 bytes) 



IPv6 


Figure 8-1 Encapsulation of ICMP messages in IPv4 and IPv6. The ICMP header contains a check¬ 
sum covering the ICMP data area. In ICMPv6, the checksum also covers the Source and 
Destination IPv6 Address, Length, and Next Header fields in the IPv6 header. 
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In IPv4, a Protocol field value of 1 indicafes fhaf fhe dafagram caries ICMPv4. 
In IPv6, fhe ICMPv6 message may begin affer zero or more exfension headers. The 
lasf exfension header before fhe ICMPv6 header includes a Next Header field wifh 
value 58. ICMP messages may be fragmenfed like of her IP dafagrams (see Chapfer 
10), alfhough fhis is nof common. 

Figure 8-2 shows fhe formaf of bofh ICMPv4 and ICMPv6 messages. The firsf 
4 byfes have fhe same formaf for all messages, buf fhe remainder differ from one 
message fo fhe nexf. 


0 15 16 31 


Type 

Code 

Checksum 

(8 bits) 

(8 bits) 

(16 bits) 

Contents Depend on Type and Code 


(variabie) 


Figure 8-2 All ICMP messages begin with 8-bit Type and Code fields, followed by a 16-bit Checksum 
that covers the entire message. The type and code values are different for ICMPv4 and 
ICMPv6. 

In ICMPv4, 42 different values are reserved for the Type field [ICMPTYPES], 
which identify the particular message. Only about 8 of these are in regular 
use, however. We will show the exact format of each commonly used message 
throughout the chapter. Many types of ICMP messages also use different values 
of the Code field to further specify the meaning of the message. The Checksum 
field covers the entire ICMPv4 message; in ICMPv6 it also covers a pseudo-header 
derived from portions of the IPv6 header (see Section 8.1 of [RFC2460]). The algo¬ 
rithm used for computing the checksum is the same as that used for the IP header 
checksum defined in Chapter 5. Note that this is our first example of an end-to-end 
checksum. It is carried all the way from the sender of the ICMP message to the 
final recipient. In contrast, the IPv4 header checksum discussed in Chapter 5 is 
changed at every router hop. If an ICMP implementation receives an ICMP mes¬ 
sage with a bad checksum, the message is discarded; there is no ICMP message 
to indicate a bad checksum in a received ICMP message. Recall that the IP layer 
has no protection on the payload portion of the datagram. If ICMP did not include 
a checksum, the contents of the ICMP message might not be correct, leading to 
incorrect system behavior. 


8.2 ICMP Messages 

We now look at ICMP messages in general and the most commonly used ones 
in more detail. ICMP messages are grouped into two major categories: those 
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messages relating to problems with delivering IP datagrams (called error mes¬ 
sages), and those related to information gathering and configuration (called query 
or informational messages). 

8.2.1 ICMPv4 Messages 

For ICMPv4, the informational messages include Echo Request and Echo Reply 
(types 8 and 0, respectively), and Router Advertisement and Router Solicitation 
(types 9 and 10, respectively, together called Router Discovery). The most common 
error message types are Destination Unreachable (type 3), Redirect (type 5), Time 
Exceeded (type 11), and Parameter Problem (type 12). Table 8-1 lists the message 
types defined for standard ICMPv4 messages. 


Table 8-1 The standard ICMPv4 message types, as determined by the Type field* 


E3H 

Official Name 

Reference 

E/I 

Use/Comment 

0(*) 

Echo Reply 

[RFC0792] 

I 

Echo (ping) reply; returns data 

3 C^)(+) 

Destination Unreachable 

[RFC0792] 

E 

Unreachable host/protocol 

4 

Source Quench 

[RFC0792] 

E 

Indicates congestion (deprecated) 

5(*) 

Redirect 

[RFC0792] 

E 

Indicates alternate router should be used 

8(*) 

Echo 

[RFC0792] 

I 

Echo (ping) request (data optional) 

9 

Router Advertisement 

[RFC1256] 

I 

Indicates router addresses/preferences 

10 

Router Solicitation 

[RFC1256] 

I 

Requests Router Advertisement 

11 (*)(+) 

Time Exceeded 

[RFC0792] 

E 

Resource exhausted (e.g., IPv4 TTL) 

12 (*)(+) 

Parameter Problem 

[RFC0792] 

E 

Malformed packet or header 


*Types marked with asterisks (*) are the most common. Those marked with a plus (+) may contain [RFC4884] 
extension objects. In the fourth column, E is for error messages and I indicates query/informational messages. 


For the commonly used messages (those with the asterisks next to the type 
number in Table 8-1), the code numbers shown in Table 8-2 are used. Some mes¬ 
sages are capable of carrying extended information [RFC4884] (those marked in 
Table 8-1 with the plus sign). 

The official list of message types is maintained by I ANA [ICMPTYPES]. 
Many of these message types were defined by the original ICMPv4 specifica¬ 
tion [RFC0792] in 1981, prior to any significant experience using them. Additional 
experience and the development of other protocols (e.g., DHCP) have resulted in 
many of the messages defined then to cease being used. When IPv6 (and ICMPv6) 
was designed, this fact was understood, so a somewhat more rational arrange¬ 
ment of types and codes has been defined for ICMPv6. 
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Table 8-2 Common ICMPv4 message types that use code numbers in addition to 0. Although all of these mes¬ 
sage types are relatively common, only a few of the codes are commonly used. 


Type 

Code 

Official Name 

Use/Comment 

3 

0 

Net Unreachable 

No route (at all) to destination 

3 0 

1 

Host Unreachable 

Known but unreachable host 

3 

2 

Protocol Unreachable 

Unknown (transport) protocol 

3 0 

3 

Port Unreachable 

Unknown/unused (transport) port 

3 0 

4 

Fragmentation Needed and Don't 
Fragment Was Set (PTB message) 

Needed fragmentation prohibited by DF 
bit; used by PMTUD [REC1191] 

3 

5 

Source Route Failed 

Intermediary hop not reachable 

3 

6 

Destination Network Unknown 

Deprecated [RFC1812] 

3 

7 

Destination Host Unknown 

Destination does not exist 

3 

8 

Source Host Isolated 

Deprecated [RFC1812] 

3 

9 

Communication with Destination 
Network Administratively 

Prohibited 

Deprecated [RFC1812] 

3 

10 

Communication with Destination 
Host Administratively Prohibited 

Deprecated [RFC1812] 

3 

11 

Destination Network Unreachable 
for Type of Service 

Type of service not available (net) 

3 

12 

Destination Host Unreachable for 
Type of Service 

Type of service not available (host) 

3 

13 

Communication Administratively 
Prohibited 

Communication prohibited by filtering 
policy 

3 

14 

Host Precedence Violation 

Precedence disallowed for src/dest/port 

3 

15 

Precedence Cutoff in Effect 

Below minimum ToS [RFC1812] 

5 

0 

Redirect Datagram for the Network 
(or Subnet) 

Indicates alternate router 

5 0 

1 

Redirect Datagram for the Host 

Indicates alternate router (host) 

5 

2 

Redirect Datagram for the Type of 
Service and Network 

Indicates alternate router (ToS/net) 

5 

3 

Redirect Datagram for the Type of 
Service and Host 

Indicates alternate router (ToS/host) 

9 

0 

Normal Router Advertisement 

Router's address and configuration 
information 

9 

16 

Does Not Route Common Traffic 

With Mobile IP [RFC5944], router does not 
route ordinary packets 

11 {*) 

0 

Time to Live Exceeded in Transit 

Hop limit/TTL exceeded 

11 

1 

Eragment Reassembly Time 

Exceeded 

Not all fragments of datagram arrived 
before reassembly timer expired 

12 O 

0 

Pointer Indicates the Error 

Byte offset (pointer) indicates first problem 
field 

12 

1 

Missing a Required Option 

Deprecated/historic 

12 

2 

Bad Length 

Packet had invalid Total Length field 
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8.2.2 ICMPv6 Messages 

Table 8-3 shows the message types defined for ICMPv6. Nofe fhaf ICMPv6 is 
responsible nof only for error and informafional messages buf also for a greaf deal 
of IPv6 roufer and hosf configurafion. 


Table 8-3 In ICMPv6, error messages have message types from 0 to 127. Informational messages have message 
types from 128 to 255. The plus (+) notation indicates that the message may contain an extension 
structure. Reserved, unassigned, experimental, and deprecated values are not shown. 


Type 

Official Name 

Reference 

Description 

1(+) 

Destination Unreachable 

[REC4443] 

Unreachable host, port, protocol 

2 

Packet Too Big (PTB) 

[REC4443] 

Fragmentation required 

3(+) 

Time Exceeded 

[RFC4443] 

Hop limit exhausted or 
reassembly timed out 

4 

Parameter Problem 

[REC4443] 

Malformed packet or header 

100,101 

Reserved for private experimentation 

[RFC4443] 

Reserved for experiments 

127 

Reserved for expansion of ICMPv6 
error messages 

[RFC4443] 

Hold for more error messages 

128 

Echo Request 

[RFC4443] 

ping request; may contain data 

129 

Echo Reply 

[RFC4443] 

ping response; returns data 

130 

Multicast Listener Query 

[RFC2710] 

Queries multicast subscribers 
(vl) 

131 

Multicast Listener Report 

[RFC2710] 

Multicast subscriber report (vl) 

132 

Multicast Listener Done 

[RFC2710] 

Multicast unsubscribe 
message (vl) 

133 

Router Solicitation (RS) 

[RFC4861] 

IPv6 RS with Mobile IPv6 
options 

134 

Router Advertisement (RA) 

[RFC4861] 

IPv6 RA with Mobile IPv6 
options 

135 

Neighbor Solicitation (NS) 

[RFC4861] 

IPv6 Neighbor Discovery 
(Solicit) 

136 

Neighbor Advertisement (NA) 

[RFC4861] 

IPv6 Neighbor Discovery 
(Advertisement) 

137 

Redirect Message 

[RFC4861] 

Use alternative next-hop router 

141 

Inverse Neighbor Discovery 

Solicitation Message 

[RFC3122] 

Inverse Neighbor Discovery 
request: requests IPv6 addresses 
given link-layer address 

142 

Inverse Neighbor Discovery 
Advertisement Message 

[RFC3122] 

Inverse Neighbor Discovery 
response: reports IPv6 addresses 
given link-layer address 

143 

Version 2 Multicast Listener Report 

[RFC3810] 

Multicast subscriber report (v2) 
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Table 8-3 In ICMPv6, error messages have message types from 0 to 127. Informational messages have message 
types from 128 to 255. The plus (+) notation indicates that the message may contain an extension 
structure. Reserved, unassigned, experimental, and deprecated values are not shown, {continued) 


Type 

Official Name 

Reference 

Description 

144 

Home Agent Address Discovery 
Request Message 

[RFC6275] 

Requests Mobile IPv6 HA 
address; send by mobile node 

145 

Home Agent Address Discovery Reply 
Message 

[RFC6275] 

Contains MIPv6 HA address; 
sent by eligible HA on home 
network 

146 

Mobile Prefix Solicitation 

[RFC6275] 

Request home prefix while away 

147 

Mobile Prefix Advertisement 

[RFC6275] 

Provides prefix from HA to 
mobile 

148 

Certification Path Solicitation Message 

[RFC3971] 

Secure Neighbor Discovery 
(SEND) request for a 
certification path 

149 

Certification Path Advertisement 
Message 

[RFC3971] 

SEND response to certification 
path request 

151 

Multicast Router Advertisement 

[RFC4286] 

Provides address of multicast 

router 

152 

Multicast Router Solicitation 

[RFC4286] 

Requests address of multicast 
router 

153 

Multicast Router Termination 

[RFC4286] 

Done using multicast router 

154 

FMIPv6 Messages 

[RFC5568] 

MIPv6 fast handover messages 

200,201 

Reserved for private experimentation 

[RFC4443] 

Reserved for experiments 

255 

Reserved for expansion of ICMPv6 
informational messages 

[RFC4443] 

Hold for more informational 
messages 


Immediately apparent in this list is the separation between the first set of mes¬ 
sage fypes and fhe second sef (i.e., fhose messages wifh fypes below 128 and fhose 
af or above). In ICMPv6, as in ICMPv4, messages are grouped info fhe informa- 
fional and error classes. In ICMPv6, however, all fhe error messages have a 0 in fhe 
high-order bif of fhe Type field. Thus, ICMPv6 fypes 0 fhrough 127 are all errors, 
and fypes 128 fhrough 255 are all informafional. Many of fhe informafional mes¬ 
sages are requesf/reply pairs. 

In comparing fhe common ICMPv4 messages wifh fhe ICMPv6 sfandard mes¬ 
sages, we conclude fhaf some of fhe efforf in designing ICMPv6 was fo eliminafe 
fhe unused messages from fhe original specificafion while refaining fhe useful 
ones. Following fhis approach, ICMPv6 also makes use of fhe Code field, primarily 
fo refine fhe meanings of cerfain error messages. In Table 8-4 we lisf fhose sfan¬ 
dard ICMPv6 message fypes (i.e., Desfinafion Unreachable, Time Exceeded, and 
Paramefer Problem) for which more fhan fhe code value 0 has been defined. 
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Table 8-4 ICMPv6 standard message types with codes in addition to 0 assigned 


Type 

Code 

Name 

Use/Comment 

1 

0 

No Route to Destination 

Route not present 

1 

1 

Administratively Prohibited 

Policy (e.g., firewall) prohibited 

1 

2 

Beyond Scope of Source Address 

Destination scope exceeds source's 

1 

3 

Address Unreachable 

Used if codes 0-2 are not appropriate 

1 

4 

Port Unreachable 

No transport entity listening on port 

1 

5 

Source Address Failed Policy 

Ingress/egress policy violation 

1 

6 

Reject Route to Destination 

Specific reject route to destination 

3 

0 

Hop Limit Exceeded in Transit 

Hop Limit field decremented to 0 

3 

1 

Reassembly Time Exceeded 

Unable to reassemble in limited time 

4 

0 

Erroneous Header Eield Found 

General header processing error 

4 

1 

Unrecognized Next Header 

Unknown Next Header field value 

4 

2 

Unrecognized IPv6 Option 

Unknown Hop-by-Hop or Destination option 


In addition to the Type and Code fields that define basic funcfions in ICMPv6, a 
large number of sfandard opfions are also supporfed, some of which are required. 
This disfinguishes ICMPv6 from ICMPv4 (ICMPv4 does nof have opfions). Cur- 
renfly, sfandard ICMPv6 opfions are defined for use only wifh fhe ICMPv6 ND 
messages (fypes 135 and 136) using fhe Option Format field discussed in [RFC4861]. 
We discuss fhese opfions when exploring ND in more defail in Secfion 8.5. 

8.2.3 Processing of ICMP Messages 

In ICMP, fhe processing of incoming messages varies from sysfem fo sysfem. Gen¬ 
erally speaking, fhe incoming informafional requesfs are handled aufomafically 
by fhe operafing sysfem, and fhe error messages are delivered fo user processes 
or fo a fransporf protocol such as TCP [RFC5461]. The processes may choose fo 
acf on fhem or ignore fhem. Excepfions fo fhis general rule include fhe Redirecf 
message and fhe Desfinafion Unreachable—Fragmenfafion Required messages. 
The former resulfs in an aufomafic update fo fhe hosf's roufing fable, whereas fhe 
laffer is used in fhe pafh MTU discovery (PMTUD) mechanism, which is generally 
implemenfed by fhe fransporf-layer protocols such as TCP. In ICMPv6 fhe han¬ 
dling of messages has been fighfened somewhaf. The following rules are applied 
when processing incoming ICMPv6 messages [RFC4443]: 

1. Unknown ICMPv6 error messages musf be passed fo fhe upper-layer pro¬ 
cess fhaf produced fhe dafagram causing fhe error (if possible). 

2. Unknown ICMPv6 informafional messages are dropped. 
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3. ICMPv6 error messages include as much of the original ("offending") IPv6 
datagram that caused the error as will fit without making the error mes¬ 
sage datagram exceed the minimum IPv6 MTU (1280 bytes). 

4. When processing ICMPv6 error messages, the upper-layer protocol type is 
extracted from the original or "offending" packet (contained in the body of 
the ICMPv6 error message) and used to select the appropriate upper-layer 
process. If this is not possible, the error message is silently dropped after 
any IPv6-layer processing. 

5. There are special rules for handling errors (see Section 8.3). 

6. An IPv6 node must limit the rate of ICMPv6 error messages it sends. There 
are a variety of ways of implementing the rate-limiting function, including 
the token bucket approach mentioned in Section 8.3. 


8.3 ICMP Error Messages 

The distinction between the error and informational classes of ICMP messages men¬ 
tioned in the previous section is important because certain restrictions are placed 
on the generation of ICMPv4 error messages by [RFC1812] and on the generation 
of ICMPv6 error messages by [RFC4443] that do not apply to queries. In particular, 
an ICMP error message is not to be sent in response to any of the following mes¬ 
sages: another ICMP error message, datagrams with bad headers (e.g., bad check¬ 
sum), IP-layer broadcast/multicast datagrams, datagrams encapsulated in link-layer 
broadcast or multicast frames, datagrams with an invalid or network zero source 
address, or any fragment other than the first. The reason for imposing these restric¬ 
tions on the generation of ICMP errors is to limit the creation of so-called broadcast 
storms, a scenario in which the generation of a small number of messages creates an 
unwanted traffic cascade (e.g., by generating error responses in response to error 
responses, indefinitely). These rules can be summarized as follows: 

An ICMPv4 error message is never generated in response to 

• An ICMPv4 error message. (An ICMPv4 error message may, however, be 
generated in response to an ICMPv4 query message.) 

• A datagram destined for an IPv4 broadcast address or an IPv4 multicast 
address (formerly known as a class D address). 

• A datagram sent as a link-layer broadcast. 

• A fragment other than the first. 

• A datagram whose source address does not define a single host. This means 
that the source address cannot be a zero address, a loopback address, a 
broadcast address, or a multicast address. 
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ICMPv6 is similar. An ICMPv6 error message is never generated in response to 

• An ICMPv6 error message 

• An ICMPv6 Redirect message 

• A packet destined for an IPv6 multicast address, with two exceptions: 

- The Packet Too Big (PTB) message 

- The Parameter Problem message (code 2) 

• A packet sent as a link-layer multicast (with the exceptions noted previously) 

• A packet sent as a link-layer broadcast (with the exceptions noted previously) 

• A packet whose source address does not uniquely identify a single node. 
This means that the source address cannot be an unspecified address, an 
IPv6 multicast address, or any address known by the sender to be an any- 
cast address. 

In addition to the rules governing the conditions under which ICMP messages 
are generated, there is also a rule that limits the overall ICMP traffic level from a 
single sender. In [RFC4443], a recommendation for rate-limiting ICMP messages 
is to use a token bucket. With a token bucket, a "bucket" holds a maximum number 
(B) of "tokens," each of which allows a certain number of messages to be sent. 
The bucket is periodically filled with new tokens (at rate N) and drained by 1 for 
each message sent. Thus, a token bucket (or token bucket filter, as it is often called) 
is characterized by the parameters (B, N). For small or midsize devices, [RFC4443] 
provides an example token bucket using the parameters (10, 10). Token buckets 
are a common mechanism used in protocol implementations to limit bandwidth 
utilization, and in many cases B and N are in byte units rather than message units. 

When an ICMP error message is sent, it contains a copy of the full IP header 
from the "offending" or "original" datagram (i.e., the IP header of the datagram 
that caused the error to be generated, including any IP options), plus any other 
data from the original datagram's IP payload area such that the generated IP/ 
ICMP datagram's size does not exceed a specific value. For IPv4 this value is 576 
bytes, and for IPv6 it is the IPv6 minimum MTU, which is at least 1280 bytes. 
Including a portion of the payload from the original IP datagram lets the receiv¬ 
ing ICMP module associate the message with one particular protocol (e.g., TCP 
or UDP) from the Protocol or Next Header field in the IP header and one particular 
user process (from the TCP or UDP port numbers that are in the TCP or UDP 
header contained in the first 8 bytes of the IP datagram payload area). 

Before the publication of [RFC1812], the ICMP specification required only the 
first 8 bytes of the offending IP datagram to be included (because this is enough 
to determine the port number for UDP and TCP; see Chapters 10 and 12), but as 
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more complex protocol layerings have become popular (such as IP being encap¬ 
sulated in IP), additional information is now needed for fhe effecfive diagnosis of 
problems. In addifion, several error messages may include extensions. We begin 
by briefly discussing fhe exfension mefhod, and fhen we discuss each of fhe more 
imporfanf ICMP error messages. 

8.3.1 Extended ICMP and Multipart Messages 

[RFC4884] specifies a mefhod for exfending fhe ufilify of ICMP messages by allow¬ 
ing an extension data structure fo be appended fo fhem. The exfension sfrucfure 
includes an exfension header and exfension objecfs fhaf may confain a variable 
amounf of dafa, as illusfrafed in Figure 8-3. 


0 


15 16 


31 


Type 

Code 

Checksum 

Length 
(for ICMPv6) 

Length 
(for ICMPv4) 

(Various) 


/ 


ICMP Payload (at least 128 bytes) 
(e.g., First Portion of Original Datagram) 



Vers 

(Reserved) 

Checksum 

Object Header 




Object Data 


ICMP Extension Header 
[RFC4884] 


Multiple Objects 
[RFC4884] 


... Additional Objects 


Figure 8-3 Extended ICMPv4 and ICMPv6 messages include a 32-bit extension header and zero or more 
associated objects. Each object includes a fixed-size header and a variable-length data area. Eor 
compatibility, the primary ICMP payload area is at least 128 bytes. 
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The Length field is repurposed from fhe sixfh byfe of fhe ICMPv4 header and 
fhe fiffh byfe of fhe ICMPv6 header. (These byfes had previously been reserved 
wifh value 0.) In ICMPv4, if indicafes fhe offending dafagram size in 32-bif word 
unifs. For ICMPv6, if is in 64-bif unifs. These dafagram porfions are padded wifh 
zeros as necessary fo be 32-bif- and 64-bif-aligned, respecfively When exfensions 
are used, fhe ICMP payload area confaining fhe original dafagram musf be af leasf 
128 byfes long. 

The exfension sfrucfure may be used wifh ICMPv4 Desfinafion Unreachable, 
Time Exceeded, and Paramefer Problem messages as well as ICMPv6 Desfinafion 
Unreachable and Time Exceeded messages. We will look af each of fhese in some 
defail in fhe following secfions. 

8.3.2 Destination Unreachabie (iCMPv4 Type 3, iCMPv6 Type 1) and Packet Too Big 
(iCMPvSType 2) 

We now look more closely at one of the more common ICMP message types. Des¬ 
tination Unreachable. Messages of this t 5 ^e are used to indicate that a datagram 
could not be delivered all the way to its destination because of either a problem in 
transit or the lack of a receiver interested in receiving it. Although 16 different codes 
are defined for this message in ICMPv4, only 4 are commonly used. These include 
Host Unreachable (code 1), Port Unreachable (code 3), Fragmentation Required/ 
Don't-Fragment Specified (code 4), and Communication Administratively Pro¬ 
hibited (code 13). In ICMPv6, the Destination Unreachable message is type 1 with 
seven possible code values. In ICMPv6, as compared with IPv4, the Fragmentation 
Required message has been replaced by an entirely different t 5 q)e (type 2), but the 
usage is very similar to the corresponding ICMP Destination Unreachable message, 
so we discuss it here. In ICMPv6 this is called the Packet Too Big (PTB) message. We 
will use the simpler ICMPv6 PTB terminology from here onward to refer to either 
the ICMPv4 (type 3, code 4) message or the ICMPv6 (type 2, code 0) message. 

The formats for all of the Destination Unreachable messages specified for 
ICMPv4 and ICMPv6 are shown in Figure 8-4. For Destination Unreachable mes¬ 
sages, the Type field is 3 for ICMPv4 and 1 for ICMPv6. The Code field indicates the 
particular item or reason for the reachability failure. We now look at each of these 
messages in detail. 

8.3.2.1 ICMPv4 Host Unreachable (Code 1) and ICMPvG Address Unreachable 
(Code 3) 

This form of the Destination Unreachable message is generated by a router or 
host when it is required to send an IP datagram to a host using direct delivery 
(see Chapter 5) but for some reason cannot reach the destination. This situation 
may arise, for example, because the last-hop router is attempting to send an ARP 
request to a host that is either missing or down. This situation is explored in Chap¬ 
ter 4, which describes ARP. For ICMPv6, which uses a somewhat different mecha¬ 
nism for detecting unresponsive hosts, this message can be the result of a failure 
in the ND process (see Section 8.5). 
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0 1516 31 


Type (3) 

Code 

Checksum (16 bits) 

Unused 

Length 

(in 32-bit words) 

Various 

(epends on code) 


IPv4 Header + Initial 
Bytes of Original Datagram 


Extension Structure 
(If Present) 


0 1516 31 


Type(1) 

Code 

Checksum (16 bits) 

Length 

(in 64-bit words) 

Unused 


IPv6 Header + Initial 
Bytes of Original Datagram 


Extension Structure 
(If Present) 


Figure 8-4 The ICMP Destination Unreachable messages in ICMPv4 (left) and ICMPv6 (right). The Length 
field, present in extended ICMP implementations that conform to [RFC4884], gives the number of 
words used to hold the original datagram measured in 4-byte units (IPv4) or 8-byte units (IPv6). 
An optional extension structure may be included. The ICMP field labeled various is used to hold 
the next-hop MTU when the code value is 4, which is used by PMTUD. ICMPv6 uses a different 
ICMPv6 PTB message (ICMPv6 type 2) for this purpose. 


8.3.2.2 ICMPvG No Route to Destination (Code 0) 

This message refines the Host Unreachable message from ICMPv4 to differenti¬ 
ate those hosts not reachable because of failure of direct delivery and those that 
cannot be reached because no route is present. This message is generated only in 
cases where an arriving datagram must be forwarded without using direct deliv¬ 
ery, but where no route entry exists to indicate what router to use as a next hop. As 
we have seen, IP routers must contain a valid next-hop forwarding entry for the 
destination in any packets they receive if they are going to successfully perform 
forwarding. 

8.3.2.3 ICMPv4 Communication Administratively Prohibited (Code 13) and 
iCMPvG Communication with Destination Administratively Prohibited 
(Code 1) 

In ICMPv4 and ICMPv6, these Destination Unreachable messages provide the abil¬ 
ity to indicate that an administrative prohibition is preventing successful communi¬ 
cation with the destination. This is typically the result of a firewall (see Chapter 7) 
that intentionally drops traffic that fails to comply with some operational policy 
enforced by the router that sent the ICMP error. In many cases, the fact that there is 
a special policy to drop traffic should not be advertised, so it is generally possible 
to disable the generation of these messages by either silently discarding incoming 
packets or generating some other ICMP error message instead. 

8.3.2.4 ICMPv4 Port Unreachable (Code 3) and ICMPvG Port Unreachable (Code 4) 
The Port Unreachable message is generated when an incoming datagram is des¬ 
tined for an application that is not ready to receive it. This occurs most commonly 
in conjunction with UDP (see Chapter 10), when a message is sent to a port number 
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that is not in use by any server process. If UDP receives a datagram and the des¬ 
tination port does not correspond to a port that some process has in use, UDP 
responds with an ICMP Port Unreachable message. 

We can illustrate the operation of ICMPv4 Porf Unreachable messages using 
fhe Trivial File Transfer Protocol (TFTP) [RFC1350] clienf on Windows or Linux while 
wafching fhe packef exchange using tcpdump. The well-known UDP porf for fhe 
TFTP service is 69. However, while fhe TFTP clienf is available on many sysfems, 
mosf sysfems do nof run TFTP servers. Therefore, if is easy fo see whaf happens 
when we fry fo access a nonexisfenf server. In fhe example shown in Lisfing 8-1, 
we execufe fhe TFTP clienf, called tf tp, on a Windows machine and affempf fo 
fefch a file from a Linux machine. The -s opfion for tcpdump causes 1500 byfes 
fo be capfured per packef; fhe -i ethl opfion fells tcpdump fo monitor fraffic on 
fhe Efhernef inferface named ethl; fhe -vv opfion causes addifional descripfive 
oufpuf fo be included; and fhe expression icmp or port tf tp causes fraffic 
mafching eifher fhe TFTP porf (69) or fhe ICMPv4 protocol fo be included in fhe 
oufpuf. 

Listing 8-1 TFTP client demonstrating an application timeout and ICMP rate limiting 

C:\> tftp 10.0.0.1 get /foo try to fetch file "/foo” from 10.0.0.1 

Timeout occurred timeout occurred after about 9 seconds 

Linux# tcpdrunp -s 1500 -i ethl -w icmp or port tftp 

1 09:45:48.974812 IP (tos 0x0, ttl 128, id 9914, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/foo" netascii 

2 09:45:48.974812 IP (tos OxcO, ttl 255, id 43734, offset 0, flags 

[none], length: 72) 

10.0.0.1 > 10.0.0.54: icmp 52: 

10.0.0.1 udp port tftp unreachable 

for IP (tos 0x0, ttl 128, id 9914, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/foo" netascii 

3 09:45:49.014812 IP (tos 0x0, ttl 128, id 9915, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/foo" netascii 

4 09:45:49.014812 IP (tos OxcO, ttl 255, id 43735, offset 0, flags 

[none], length: 72) 

10.0.0.1 > 10.0.0.54: icmp 52: 

10.0.0.1 udp port tftp unreachable 

for IP (tos 0x0, ttl 128, id 9915, offset 0, 
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flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/too" netascii 

5 09:45:49.014812 IP (tos 0x0, ttl 128, id 9916, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/too" netascii 

6 09:45:49.014812 IP (tos OxcO, ttl 255, id 43736, offset 0, flags 

[none], length: 72) 

10.0.0.1 > 10.0.0.54: icmp 52: 

10.0.0.1 udp port tftp unreachable 

for IP (tos 0x0, ttl 128, id 9916, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/too" netascii 

7 09:45:49.024812 IP (tos 0x0, ttl 128, id 9917, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/too" netascii 

S 09:45:49.024812 IP (tos OxcO, ttl 255, id 43737, offset 0, 
flags [none], length: 72) 

10.0.0.1 > 10.0.0.54: icmp 52: 

10.0.0.1 udp port tftp unreachable 

for IP (tos 0x0, ttl 128, id 9917, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/too" netascii 

9 09:45:49.024812 IP (tos 0x0, ttl 128, id 9918, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/too" netascii 

10 09:45:49.024812 IP (tos OxcO, ttl 255, id 43738, offset 0, 

flags [none], length: 72) 

10.0.0.1 > 10.0.0.54: icmp 52: 

10.0.0.1 udp port tftp unreachable 

for IP (tos 0x0, ttl 128, id 9918, offset 0, 
flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/too" netascii 

11 09:45:49.034812 IP (tos 0x0, ttl 128, id 9919, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/too" netascii 
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12 09:45:49.034812 IP (tos OxcO, ttl 255, id 43739, offset 0, 

flags [none], length: 72) 

10.0.0.1 > 10.0.0.54: icmp 52: 

10.0.0.1 udp port tftp unreachable 

for IP (tos 0x0, ttl 128, id 9919, offset 0, 
flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/foo" netascii 

13 09:45:49.034812 IP (tos 0x0, ttl 128, id 9920, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/foo" netascii 

14 09:45:57.054812 IP (tos 0x0, ttl 128, id 22856, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/foo" netascii 

15 09:45:57.054812 IP (tos OxcO, ttl 255, id 43740, offset 0, 

flags [none], length: 72) 

10.0.0.1 > 10.0.0.54: icmp 52: 

10.0.0.1 udp port tftp unreachable 

for IP (tos 0x0, ttl 128, id 22856, offset 0, 

flags [none], length: 44) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 16 
RRQ "/foo" netascii 

16 09:45:57.064812 IP (tos 0x0, ttl 128, id 22906, offset 0, 

flags [none], length: 51) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 

23 ERROR EUNDEF timeout on receive" 

17 09:45:57.064812 IP (tos OxcO, ttl 255, id 43741, offset 0, 

flags [none], length: 79) 

10.0.0.1 > 10.0.0.54: icmp 59: 

10.0.0.1 udp port tftp unreachable 

for IP (tos 0x0, ttl 128, id 22906, offset 0, 

flags [none], length: 51) 

10.0.0.54.3871 > 10.0.0.1.tftp: [udp sum ok] 

23 ERROR EUNDEF timeout on receive" 


Here we see a set of seven requests grouped very close to each other in time. 
The initial request (identified as RRQ for file /foo) comes from UDP port 3871, 
destined for the TFTP service (port 69). An ICMPv4 Port Unreachable message is 
immediately returned (packet 2), but the TFTP client appears to ignore the mes¬ 
sage, sending another UDP datagram right away. This continues immediately 
six more times. After waiting about another 8s, the client tries one last time and 
finally gives up. 
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Note that the ICMPv4 messages are sent without any port number designa¬ 
tion, and each 16-byte TFTP packet is from a specific porf (3871) and fo a specific 
porf (TFTP, equal fo 69). The number 16 af fhe end of each TFTP read requesf 
(RRQ) line is fhe lengfh of fhe dafa in fhe UDP dafagram. In fhis example, 16 is 
fhe sum of fhe TFTP's 2-byfe opcode, fhe 5-byfe null-ferminafed name /foo, and 
fhe 9-byfe null-ferminafed sfring net ascii. The full ICMPv4 Unreachable mes¬ 
sage is depicfed in Figure 8-5. If is 52 byfes long (nof including fhe IPv4 header): 
4 byfes for fhe basic ICMPv4 header, followed by 4 unused byfes (see Figure 8-5; 
fhis implemenfafion does nof use [RFC4884] exfensions), fhe 20-byfe offending 
IPv4 header, 8 byfes for fhe UDP header, and finally fhe remaining 16 byfes from 
fhe original tf tp applicafion requesf (4 + 4 + 20 + 8 + 16 = 52). 


M- 


IPv4 Datagram 


M - ICMPv4 Message -^ 

ICMP Payload 

(Portion of Offending Datagram) 


IPv4 Header 

ICMP 

Header 

IP Header 
of Offending 

UDP 

TFTP Application-Layer 

(Protocol=1) 

-1- 

Datagram 

Header 

Data 


(unused) 

(Protocol=17) 




(20 bytes) (8 Bytes) (20 Bytes) (8 Bytes) (Not More than 520 Bytes) 

(No Options) (No Options) 


Figure 8-5 An ICMPv4 Destination Unreachable - Port Unreachable error message contains as 
much of the offending IPv4 datagram as possible such that the overall IPv4 datagram 
does not exceed 576 bytes. In this example, there is enough room to include the entire 
TFTP request message. 


As menfioned previously, one reason ICMP includes fhe offending IP header 
in error messages is fhaf doing so helps ICMP know how fo inferpref fhe byfes fhaf 
follow encapsulafed IP header (fhe UDP header in fhis example). Because a copy of 
fhe offending UDP header is included in fhe refurned ICMP message, fhe source 
and desfinafion porf numbers can be learned. If is fhis desfinafion porf number 
(tf tp, 69) fhaf caused fhe ICMP Porf Unreachable message fo be generafed. The 
source porf number (3871) can be used by fhe sysfem receiving fhe ICMP error fo 
associafe fhe error wifh a parficular user process (fhe TFTP clienf in fhis example, 
alfhough we saw fhaf fhis clienf does nof make much use of fhe indicafion). 

Nofe fhaf affer fhe sevenfh requesf (packef 13), no error is refurned for some 
fime. The reason for fhis is fhaf fhe Linux-based server performs rate limiting. Thaf 
is, if limifs fhe number of ICMP messages of fhe same type that can be generated 
in a period of time, as suggested by [RFC1812]. If we look at the elapsed time 
between the initial error message (packet 2, with timestamp 48.974812) and the 
final message before the 8s gap (packet 12, with timestamp 49.034812), we compute 
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that 60ms have elapsed. If we count the number of ICMP messages over fhis fime, 
we conclude fhaf (6 messages/.06s) = 100 messages/s is fhe rafe limif. This can be 
verified by inspecfing fhe values of fhe ICMPv4 rafe mask and rafe limif in Linux: 


Linux% sysctl -a | grep icmp_rate 

net.ipv4.icmp_ratemask = 6168 
net.ipv4.icmp_ratelimit = 100 

Here we see fhaf several ICMPv4 messages are fo be rafe-limifed, and fhaf fhe 
rafe limif for all of fhem is 100 (measured in messages per second). The ratemask 
variable indicafes which messages have fhe limif applied fo fhem, by fuming on 
fhe kth bif in fhe mask if fhe message wifh code number k is fo be limifed, sfarfing 
from 0. In fhis case, codes 3,4,11, and 12 are being limifed (because 6168 = 0x1818 
= 0001100000011000, where bifs 3,4,11, and 12 from fhe righf are fumed on). If we 
were fo sef fhe rafe limif fo 0 (meaning no limif), we would find fhaf Linux refurns 
nine ICMPv4 messages, one corresponding fo each tf tp requesf packef, and fhe 
tf tp clienf fimes ouf almosf immediafely. This behavior also occurs when frying 
fo access a Windows XP machine, which does nof perform ICMP rafe limifing. 

Why does fhe TFTP clienf keep refransmiffing ifs requesf when fhe error mes¬ 
sages are being refurned? A defail of nefwork programming is revealed here. 
Mosf sysfems do nof nofify user processes using UDP fhaf ICMP fhaf messages for 
fhem have arrived unless fhe process calls a special funcfion (i.e., connect on fhe 
UDP sockef). Common TFTP clienfs do nof call fhis funcfion, so fhey never receive 
fhe ICMP error nofificafion. Wifhouf hearing any response regarding fhe fafe of 
ifs TFTP protocol requesfs, fhe TFTP clienf fries again and again fo refrieve ifs file. 
This is an example of a poor requesf and refry mechanism. Alfhough TFTP does 
have exfensions for adjusfing fhis behavior (see [RFC2349]), we shall see later (in 
Chapfer 16) fhaf a more sophisficafed fransporf profocol such as TCP has a much 
beffer algorifhm. 

8.3.2.5 ICMPv4 PTB (Code 4) 

If an IPv4 roufer receives a dafagram fhaf if infends fo forward, and if fhe dafa- 
gram does nof fif info fhe MTU in use on fhe selecfed oufgoing nefwork interface, 
fhe dafagram musf be fragmenfed (see Chapfer 10). If fhe arriving dafagram has 
fhe Don't Fragment bif field sef in ifs IP header, however, if is nof forwarded buf 
instead is dropped, and fhis ICMPv4 Desfinafion Unreachable (PTB) message is 
generafed. Because fhe roufer sending fhis message knows fhe MTU of fhe nexf 
hop, if is able fo include fhe MTU value in fhe error message if generafes. 

This message was originally infended fo be used for nefwork diagnosfics buf 
has since been used for pafh MTU discovery. PMTUD is used fo defermine an 
appropriafe packef size fo use when communicafing wifh a parficular hosf, on fhe 
assumpfion fhaf avoiding packef fragmenfafion is desirable. If is used mosf com¬ 
monly wifh TCP, and we cover if in more defail in Chapfer 14. 
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8.3.2.6 ICMPvG PTB (Type 2, Code 0) 

In ICMPv6, a special message and type code combination is used to indicate that 
a packet is too large for the MTU of fhe nexf hop (see Figure 8-6). 


0 15 16 31 


Type (2) 

Code (0) 

Checksum 

MTU 


IPv6 Header + Initial Bytes of 
Original Datagram 


Figure 8-6 The ICMPv6 Packet Too Big message (type 2) works like the corresponding ICMPv4 
Destination Unreachable message. The ICMPv6 variant includes 32 bits to hold the next- 
hop MTU. 

This message is not a Destination Unreachable message. Recall that in IPv6, 
packet fragmentation is performed only by the sender of a datagram and that 
MTU discovery is always supposed to be used. Thus, this message is used pri¬ 
marily by the IPv6 PMTUD mechanism, but also in the (rare) circumstances that 
a packet arrives that is too large to be carried over the next hop. Because routes 
may change after the operation of PMTUD and after a packet is injected into the 
network, it is always possible that a packet arriving at a router is too large for the 
outgoing MTU. As is the case with modern implementations of ICMPv4 Destina¬ 
tion Unreachable code 4 (PTB) messages, the suggested MTU size of the packet, 
based on the MTU of the egress link of the router generating the ICMP message, 
is carried in the indication. 

8.3.2.7 ICMPvG Beyond Scope of Source Address (Code 2) 

As we saw in Chapter 2, IPv6 uses addresses of different scopes. Thus, it is pos¬ 
sible to construct a packet with source and destination addresses of different 
scopes. Furthermore, it is possible that the destination address may not be reach¬ 
able within the same scope. For example, a packet with a source address using 
link-local scope may be destined for a globally scoped destination that requires 
traversal of more than one router. Because the source address is of insufficient 
scope, the packet is dropped by a router, and this form of ICMPv6 error is pro¬ 
duced to indicate the problem. 
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8.3.2.8 ICMPvd Source Address Failed Ingress/Egress Policy (Code 5) 

Code 5 is a more refined version of code 1, fo be used when a parficular ingress 
or egress filfering policy is fhe reason for prohibifing fhe successful delivery of a 
dafagram. This mighf be used, for example, when a hosf affempfs fo send fraffic 
using a source IPv6 address from an unexpecfed nefwork prefix [RFC3704]. 

8.3.2.9 ICMPvS Reject Route to Destination (Code 6) 

A reject or blocking route is a special roufing or forwarding fable enfry (see Chapfer 
5), which indicafes fhaf mafching packefs should be dropped and an ICMPv6 Des- 
finafion Unreachable Rejecf Roufe message should be generafed. (A similar type of 
entry called a blackhole route also causes matching packets to be dropped, but usu¬ 
ally without generating the Destination Unreachable message.) Such routes may 
be installed in a router's forwarding table to prevent leakage of packets sent to 
unwanted destinations. Unwanted destinations may include martian routes (pre¬ 
fixes not used on the public Internet) and bogons (valid prefixes not yet allocated). 

8.3.3 Redirect (ICMPv4 Type 5, ICMPv6 Type 137) 

If a router receives a datagram from a host and can determine that it is not the cor¬ 
rect next hop for the host to have used to deliver the datagram to its destination, 
the router sends a Redirect message to the host and sends the datagram on to the 
correct router (or host). That is, if it can determine that there is a better next hop 
than itself for the given datagram, it redirects the host to update its forwarding 
table so that future traffic for the same destination will be directed toward the 
new node. This facility provides a crude form of routing protocol by indicating to 
the IP forwarding function where to send its packets. The process of IP forward¬ 
ing is discussed in detail in Chapter 5. 

In Figure 8-7, a network segment has a host and two routers, R1 and R2. When 
the host sends a datagram incorrectly through router R2, R2 responds by sending 
the Redirect message to the host, while forwarding the datagram to Rl. Although 
hosts may be configured to update their forwarding tables based on ICMP redi¬ 
rects, routers are discouraged from doing so under the assumption that rout¬ 
ers should already know the best next-hop nodes for all reachable destinations 
because they are using dynamic routing protocols. 

The ICMP Redirect message includes the IP address of the router (or destina¬ 
tion host, if it is reachable using direct delivery) a host should use as a next hop for 
the destination specified in the ICMP error message (see Figure 8-8). Originally 
the redirect facility supported a distinction between a redirect for a host and a 
redirect for a network, but once classless addressing was used (CIDR; see Chapter 
2), the network redirect form effectively vanished. Thus, when a host receives a 
host redirect, it is effective only for that single IP destination address. A host that 
consistently chooses the wrong router can wind up with a forwarding table entry 
for every destination it contacts outside its local subnet, each of which has been 
added as the result of receiving a Redirect message from its configured default 
router. The format of the ICMPv4 Redirect message is shown in Figure 8-8. 
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Datagram 


R1 R2 

Figure 8-7 The host incorrectly sends a datagram via R2 toward its destination. R2 realizes the 
host's mistake and sends the datagram to the proper router, Rl. It also informs the host 
of the error by sending an ICMP Redirect message. The host is expected to adjust its for¬ 
warding tables so that future datagrams to the same destination go through Rl without 
bothering R2. 


8 bytes 


Figure 8-8 The ICMPv4 Redirect message includes the IPv4 address of the correct router to use as a 
next hop for the datagram included in the payload portion of the message. A host typi¬ 
cally checks the IPv4 source address of the incoming Redirect message to verify that it is 
coming from the default router it is currently using. 


We can examine the behavior of a Redirect message by changing our host to 
use an incorrect router (another host on the same network) as its default next hop. 
As an example, we first change our default route and then attempt to contact a 
remote server. Our system will mistakenly attempt to forward its outgoing pack¬ 
ets to the specified host: 

C:\> netstat -rn 

Network Dest Netmask Gateway Interface Metric 

0 . 0 . 0.0 0 . 0 . 0.0 10 . 212 . 2.1 10 . 212 . 2.88 1 


0 1516 31 

Type (5) Code{1) Checksum 

IPv4 Address That Should Be Used as Next Hop 

As much of offending datagram as possible so that 
resulting IP/ICMP datagram does not exceed 
I 576 bytes 
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C: \ > route delete 0.0.0.0 delete default 

C:\> route add 0.0.0.0 mask 0.0.0.0 10.212.2.112 addnew 

C:\> ping dsl.eecs.berkeley.edu sends thru 10.212.2.112 

Pinging dsl.eecs.berkeley.edu [169.229.60.105] with 32 bytes of data: 

Reply from 169.229.60.105: bytes=32 time=lms TTL=250 
Reply from 169.229.60.105: bytes=32 time=5ms TTL=250 
Reply from 169.229.60.105: bytes=32 time=lms TTL=250 
Reply from 169.229.60.105: bytes=32 time=lms TTL=250 

Ping statistics for 169.229.60.105: 

Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), 

Approximate round trip times in milli-seconds: 

Minimum = 1ms, Maximum = 5ms, Average = 2ms 


While this is taking place, we can run tcpdump to observe the activities (some 
lines have been wrapped for clarity): 

Linux# tcpd\unp host 10.212.2.88 

1 20:27:00.759340 IP 10.212.2.88 > dsl.eecs.berkeley.edu: icmp 40: 

echo request seg 15616 

2 20:27:00.759445 IP 10.212.2.112 > 10.212.2.88: icmp 68: 

redirect dsl.eecs.berkeley.edu to host 10.212.2.1 

3 20:27:00.759468 IP 10.212.2.88 > dsl.eecs.berkeley.edu: icmp 40: 

echo request seq 15616 


Here our host (10.212.2.88) sends an ICMPv4 Echo Request (ping) message 
to the host dsl.eecs.berkeley.edu. After the name is resolved by DNS (see 
Chapter 11) to the IPv4 address 169.229.60.105, the Request message is sent to the 
first hop, 10.212.2.112, rather than the correct default router, 10.212.2.1. Because 
the system with IPv4 address 10.212.2.112 is properly configured, if under- 
sfands fhaf fhe original sendinghosf should have used fhe roufer 10.212.2.1. As 
expecfed, if responds wifh an ICMPv4 Redirecf message toward fhe hosf, indicaf- 
ing fhaf in fhe fufure, any fraffic desfined for dsl.eecs.berkeley.edu should 
go fhrough fhe roufer 10.212.2.1. 

In ICMPv6, fhe Redirecf message (fype 137) confains fhe fargef address and 
fhe desfinafion address (see Figure 8-9), and if is defined in conjuncfion wifh fhe 
ND process (see Secfion 8.5). The Target Address field confains fhe correcf node's 
link-local IPv6 address fhaf should be used for fhe nexf hop. The Destination 
Address is fhe desfinafion IPv6 address in fhe dafagram fhaf evoked fhe redirecf. 
In fhe parficular sifuafion where fhe desfinafion is an on-link neighbor fo fhe hosf 
receiving fhe redirecf, fhe Target Address and Destination Address fields are idenfi- 
cal. This provides a mefhod for informing a hosf fhaf anofher hosf is on fhe same 
link, even if fhe fwo hosfs do nof share a common address prefix [RFC5942]. 
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15 16 


31 



8 bytes 


Figure 8-9 The ICMPv6 Redirect message. The target address indicates the IPv6 address of a better 
next-hop router for the node identified by the destination address. This message can also 
be used to indicate that the destination address is an on-link neighbor to the node send¬ 
ing the message that induced the error message. In this case, the destination and target 
addresses are the same. 


As with other ND messages in ICMPv6, this message can include options. The 
types of options include the Target Link-Layer Address option and the Redirected 
Header option. The Target Link-Layer Address option is required in cases where 
the Redirect message is used on a non-broadcast multiple access (NBMA) network, 
because in such cases there may be no other efficient way for fhe hosf receiving 
fhe Redirecf message fo defermine fhe link-layer address for fhe new nexf hop. 
The Redirecfed Header opfion holds a porfion of fhe IPv6 packef fhaf caused fhe 
Redirecf message fo be generafed. We discuss fhe formaf of fhese opfions and ofh- 
ers in Secfion 8.5 when exploring IPv6 Neighbor Discovery. 

8.3.4 ICMP Time Exceeded (ICMPv4 Type 11, ICMPv6 Type 3) 

Every IPv4 dafagram has a Time-to-Live {TTL) field in ifs IPv4 header, and every 
IPv6 dafagram has a Hop Limit field in ifs header (see Chapfer 5). As originally 
conceived, fhe 8-bif TTL field was fo hold fhe number of seconds a dafagram was 
allowed fo remain acfive in fhe nefwork before being forcibly discarded (a good 
fhing if forwarding loops are presenf). Because of an addifional rule fhaf said fhaf 
any roufer musf decremenf fhe TTL field by af leasf 1, combined wifh fhe facf fhaf 
dafagram forwarding fimes grew fo be small fracfions of a second, fhe TTL field 
has been used in pracfice as a limifafion on fhe number of hops an IPv4 dafagram 
is allowed fo fake before if is discarded by a roufer. This usage was formalized and 
ulfimafely adopfed in IPv6. ICMP Time Exceeded (code 0) messages are generafed 
when a roufer discards a dafagram because fhe TTL or Hop Limit field is foo low 
(i.e., arrives wifh value 0 or 1 and musf be forwarded). This message is imporfanf 
for fhe proper operafion of fhe traceroute fool (called tracert on Windows). 
Ifs formaf, for bofh ICMPv4 and ICMPv6, is given in Eigure 8-10. 
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0 1516 31 


Checksum 

(IPv4: 11, IPv6: 3) (0 Of 1) 


Unused (0) 

As much of offending datagram as possible so that 
the resulting IPv4/ICMP datagram does not exceed 
576 bytes or the minimum MTU (IPv6) 



Figure 8-10 The ICMP Time Exceeded message format for ICMPv4 and ICMPv6. The message is 
standardized for both the TTL or hop count being exceeded (code 0) or the time for reas¬ 
sembling fragments exceeding some preconfigured threshold (code 1). 


Another less common variant of this message is when a fragmented IP data¬ 
gram only partially arrives at its destination (i.e., all its fragments do not arrive 
after a period of fime). In such cases, a varianf of fhe ICMP Time Exceeded mes¬ 
sage (code 1) is used fo inform fhe sender fhaf ifs overall dafagram has been dis¬ 
carded. Recall fhaf if any fragmenf of a dafagram is dropped, fhe enfire dafagram 
is losf. 

8.3.4.1 Example: The traceroute Tool 

The traceroute fool is used fo defermine fhe roufers used along a pafh from 
a sender fo a desfinafion. We shall discuss fhe operafion of fhe IPv4 version. The 
approach involves sending dafagrams firsf wifh an IPv4 TTL field sef fo 1 and 
allowing fhe expiring dafagrams fo induce roufers along fhe pafh fo send ICMPv4 
Time Exceeded (code 0) messages. Each round, fhe sending TTL value is increased 
by 1, causing fhe roufers fhaf are one hop farfher fo expire fhe dafagrams and 
generafe ICMP messages. These messages are senf from fhe roufer's primary IPv4 
address "facing" fhe sender. Pigure 8-11 shows how fhis approach works. 


Public IP Address: 
192.0.2.201 



Figure 8-11 The traceroute tool can be used to determine the routing path, assuming it does not 
fluctuate too quickly. When using traceroute, routers are typically identified by the 
IP addresses assigned to the interfaces "facing" or nearest to the host performing the 
trace. 
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In this example, traceroute is used to send UDP datagrams (see Chapter 
10) from the laptop to the host www.eecs.berkeley.edu (an Internet host with 
IPv4 address 128.32.244.172, not shown in Figure 8-11). This is accomplished 
using the following command: 

Linux% traceroute -m 2 www.cs.berkeley.edu 

traceroute to web2.eecs.berkeleY.edu (128.32.244.172), 2 hops max, 

52 byte packets 

1 gw (192.168.0.1) 3.213 ms 0.839 ms 0.920 ms 

2 10.0.0.1 (10.0.0.1) 1.524 ms 1.221 ms 9.176 ms 

The -m opfion insfrucfs traceroute fo perform only fwo rounds: one using 
TTL = 1 and one using TTL = 2. Each line gives fhe informafion found af fhe corre¬ 
sponding TTL. For example, line 1 indicafes fhaf one hop away a roufer wifh IPv4 
address 192.168.0.1 was found and fhaf fhreeindependenf round-frip-fimemea- 
suremenfs (3.213, 0.839, and 0.920ms) were faken. The difference befween fhe 
firsf and subsequenf fimes relafes fo addifional work fhaf is involved in fhe firsf 
measuremenf (i.e., an ARP fransacfion). Figures 8-12 and 8-13 show Wireshark 
packef capfures indicafing how fhe oufgoing dafagrams and refurning ICMPv4 
messages are sfrucfured. 


V traceroute.tr - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony Tools Help 




El « at et M s 0 X 3 gi; Q. ♦ 4 7 ^ i|Ei|[3 et Gt Gt □ s la 

'•l'!'. Time Source Destination Protocol Info 
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3 0.018506 192.168. 0.93 128.32.244.172 UDP 
0.1 192.168.0.93 ICMP 


5 0.018897 192.168.0.93 128.32.244.172 UDP 
b 0.','18939 192.188.'j.l 192.188.. 98 ICMP 
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90.032183 192.168.0.93 128.32.244.172 UDP 
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Time-to-live exceeded (Time to live exceeded in transit) 


source port: 35616 Destination port: 33437 _ 
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Time-to-live exceeded (Time to live exceeded in transit) 

Source port: 3^16 Destination port: 33439 ~ 
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source port: 35616 Destination port: 33440 _ 

Time-to-live exceeded (Time to live exceeded in transit) 


Ethernet ll, Src: 00:17:f2:c7:ea:63 (00:17:f2:c7:ea:63), Dst: 00:20:e0:6a:0a:69 (00:20:e0:6a:0a: 
S internet Protocol, Src: 192.168.0.93 (192.168.0.93), Dst: 128.32.244.172 (128.32.244.172) 
Version: 4 

Header length: 20 bytes 

s Differentiated services Field: 0x00 (dscp 0x 00: Default; ecn: 0x00) 

Total Length: 52 
Identification: 0x8b21 (35617) 

S Flags: 0x00 

Fragment offset: 0 
s Time to live: 1 
protocol: udp (17) 

S Header checksum: 0xf8c5 [correct] 
source: 192.168.0.93 (192.168.0.93) 

Destination: 128.32.244.172 (128.32.244.172) 

IS user Datagram Protocol, src Port: 35616 (35616), Dst Port: 33435 (33435) 

IS Data (24 bytes) 


Figure 8-12 traceroute using IPv4 begins by sending a UDP/IPv4 datagram with TTL = 1 to destination port 
number 33435. Each TTL value is tried three times before being incremented by 1 and retried. Each 
expiring datagram causes the router at the appropriate hop distance to send an ICMPv4 Time Exceeded 
message back to the source. The message's source address is that of the router "facing" the sender. 
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Looking at Figure 8-12, we can see that traceroute sends six datagrams, and 
that each datagram is sent to a destination port number in sequence, starting with 
33435. If we look more closely, we can see that the first three datagrams are sent 
with TTL = 1 and the second set of fhree are senf wifh TTL = 2. Figure 8-12 shows 
fhe firsf one. Each dafagram causes an ICMPv4 Time Exceeded (code 0) message 
fo be senf. The firsf fhree are senf from roufer N3 (IPv4 address 192.168.0.1), and 
fhe nexf fhree are senf from roufer N2 (IPv4 address 10.0.0.1). Pigure 8-13 shows fhe 
lasf ICMP message in more defail. 
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S Frame 12: 94 bytes on wire (752 bits), 94 bytes captured (752 bits) 

SI Ethernet ll, src: 00:20:e0:6a:0a:69 (00:20:e0:6a:0a:69), Dst: 00:17:f2:c7:ea:63 (00:17:f2:c7:ea:63) 
SI internet Protocol, Src: 10.0.0.1 (10.0.0.1), Dst: 192.168.0.93 (192.168.0.93) 

S Internet Control Message Protocol 
Type: 11 (Time-to-live exceeded) 

Code: 0 (Time to live exceeded in transit) 
checksum: 0x2b04 [correct] 

a internet Protocol, src: 192.168.0.93 (192.168.0.93), Dst: 128.32.244.172 (128.32.244.172) 

Version: 4 

Header length: 20 bytes 

a Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0 x00) 

Total Length: 52 
Identification: 0x8b26 (35622) 
a Flags: 0x00 

Fragment offset: 0 
a Time to live: 0 
Protocol: UDP (17) 
a Header checksum: 0xf9c0 [correct] 
source: 192.168.0.93 (192.168.0.93) 

Destination: 128.32.244.172 (128.32.244.172) 
a user Datagram Protocol, Src Port: 35616 (35616), Dst Port: 33440 (33440) 
a Data (24 bytes) 


Figure 8-13 The final ICMPv4 Time Exceeded message of the trace is sent by N2 (IPv4 address 
10.0.0.1). It includes a copy of the original datagram that caused the Time Exceeded 
message to be generated. The TTL of the inner IPv4 header is 0 because N2 decremented 
it from 1. 


This is fhe final Time Exceeded message of fhe frace. If confains fhe original 
IPv4 dafagram (packef 11), as seen by N2 upon receipf. This dafagram arrives wifh 
TTL = 1, buf affer being decremenfed is foo small for N2 fo perform addifional 
forwarding fo 128.32.244.172. Consequenfly, N2 sends a Time Exceeded message 
back fo fhe source of fhe original dafagram. 
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8.3.5 Parameter Problem (ICMPv4 Type 12, ICMPv6 Type 4) 

ICMP Parameter Problem messages are generated by a host or router receiving 
an IP datagram containing some problem in its IP header that cannot be repaired. 
When a datagram cannot be handled and no other ICMP message adequately 
describes the problem, this message acts as a sort of "catchall" error condition 
indicator. In both ICMPv4 and ICMPv6, if fhere is an error in fhe header such fhaf 
some field is ouf of accepfable range, a special ICMP error message Pointer field 
indicafes fhe byfe offsef of fhe field where fhe error was found, relafive fo fhe 
beginning of fhe offending IP header. Wifh ICMPv4, for example, a value of 1 in 
fhe Pointer field indicafes a bad IPv4 DS Field or ECN field (fogefher, fhese fields 
used fo be called fhe IPv4 Type of Service or ToS Byte which has since been rede¬ 
fined and renamed; see Chapfer 5). The formaf of fhe ICMPv4 Paramefer Problem 
message is shown in Figure 8-14. 


0 1516 31 


Type (12) 

Code 
(0, 1, or 2) 

Checksum 

Pointer 

Unused (0) 


As much of offending datagram as possible so that 
the resulting IP/ICMP datagram does not exceed 
576 bytes 


Figure 8-14 The ICMPv4 Parameter Problem message is used when no other message applies. The 
Pointer field indicates the byte index of the problematic value in the offending IPv4 
header. Code 0 is most common. Code 1 was formerly used to indicate that a required 
option was missing but is now historic. Code 2 indicates that the offending IPv4 data¬ 
gram has a bad IHL or Total Length field. 

Code 0 is fhe mosf common varianf of fhe ICMPv4 Paramefer Problem mes¬ 
sages and is used when fhere is almosf any problem wifh fhe IPv4 header, alfhough 
problems wifh fhe header or dafagram Total Length fields may insfead generafe 
code 2 messages. Code 1 was once used fo indicafe missing opfions such as secu- 
rify labels on packefs buf is now historic. Code 2, a more recenfly defined code, 
indicafes a bad lengfh in fhe IHL or Total Length fields (see Chapfer 5). The ICMPv6 
version of fhis error message is shown in Figure 8-15. 

In ICMPv6, fhe freafmenf of fhis error has been refined somewhaf, relafive fo 
fhe ICMPv4 version, info fhree cases: erroneous header field encountered (code 
0), unrecognized Next Header fype encountered (code 1), and unrecognized IPv6 
opfion encountered (code 2). As wifh fhe corresponding error message in ICMPv4, 
fhe ICMPv6 paramefer problem Pointer field gives fhe byfe offsef info fhe offend¬ 
ing IPv6 header fhaf caused fhe problem. For example, a Pointer field of 40 would 
indicafe a problem wifh fhe firsf IPv6 extension header. 
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0 1516 31 


Type (4) Checksum 


Pointer 

As much of offending datagram as possibie so that 
the resuiting IPv6/iCMPv6 datagram does not 
exceed the minimum MTU 



Figure 8-15 The ICMPv6 Parameter Problem message. The Pointer field gives the byte offset into 
the original datagram where an error was encountered. Code 0 indicates a bad header 
field. Code 1 indicates an unrecognized Next Header type, and Code 2 indicates that an 
unknown IPv6 option was encountered. 


The erroneous header (code 0) error occurs when a field in one of fhe IPv6 
headers confains an illegal value. A code 1 error occurs when an IPv6 Next Header 
(header chaining) field confains a value corresponding fo a header fype fhaf fhe 
IPv6 implemenfafion does nof supporf. Finally, code 2 is used when an IPv6 header 
opfion is received buf nof recognized by fhe implemenfafion. 


8.4 ICMP Query/Informational Messages 

Alfhough ICMP defines a number of query messages such as Address Mask 
Requesf/Reply (fypes 17/18), Timesfamp Requesf/Reply (fypes 13/14), and Infor- 
mafion Requesf/Reply (fypes 15/16), fhese funcfions have been replaced by ofher, 
more purpose-specific profocols (including DHCP; see Chapfer 6). The only 
remaining popular ICMP query/informafional messages are fhe Echo Requesf/ 
Response messages, more commonly called ping, and fhe Roufer Discovery mes¬ 
sages. Even fhe Roufer Discovery mechanism is nof in wide use wifh IPv4, buf ifs 
analog (parf of Neighbor Discovery) in IPv6 is fundamenfal. In addifion, ICMPv6 
has been exfended fo supporf Mobile IPv6 and fhe discovery of mulficasf-capable 
roufers. In fhis secfion, we invesfigafe fhe Echo Requesf/Reply funcfions and fhe 
messages used for basic roufer and Mulficasf Lisfener Discovery (also see Chap- 
fers 6 and 9). In fhe subsequenf secfion, we explore fhe operafion of Neighbor 
Discovery in IPv6. 

8.4.1 Echo Request/Reply (ping) (ICMPv4 Types 0/8, ICMPvB Types 129/128) 

One of fhe mosf commonly used ICMP message pairs is Echo Requesf and Echo 
Response (or Reply). In ICMPv4 fhese are fypes 8 and 0, respecfively, and in 
ICMPv6 fhey are fypes 128 and 129, respecfively. ICMP Echo Requesf messages 
may be of nearly arbifrary size (limifed by fhe ulfimafe size of fhe encapsulafing 
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IP datagram). With ICMP Echo Reply messages, the ICMP implementation is 
required to return any data received back to the sender, even if multiple IP frag¬ 
ments are involved. The ICMP Echo Request/Response message format is shown 
in Eigure 8-16. 

As with other ICMP query/informational messages, the server must echo the 
Identifier and Sequence Number fields back in fhe reply. 


0 1516 31 


Type 
(IPv4: 0/8; 
IPv6: 128/129) 

Code (0) 

Checksum 

Identifier 

Sequence Number 


Data (Optional) 


Figure 8-16 Format of the ICMPv4 and ICMPv6 Echo Request and Echo Reply messages. Any 
optional data included in a request must be returned in a reply. NATs use the Identifier 
field to match requests with replies, as discussed in Chapter 7. 


These messages are sent by the ping program, which is commonly used to 
quickly determine if a computer is reachable on the Internet. At one time, if you 
could "ping" a host, you could almost certainly reach it by other means (remote 
login, other services, etc.). With firewalls in common use, however, this is now far 
from certain. 


Note 

The name ping is taken from the sonar operation to iocate objects. The ping pro¬ 
gram was written by Mike Muuss, who maintained an amusing Web page describ¬ 
ing its history [PING]. 


Implementations of ping set the Identifier field in the ICMP message to some 
number that the sending host can use to demultiplex returned responses. In 
UNIX-based systems, for example, the process ID of the sending process is typi¬ 
cally placed in the Identifier field. This allows the ping application to identify the 
returned responses if there are multiple instances of ping running at the same 
time on the same host, because the ICMP protocol does not have the benefit of 
transport-layer port numbers. This field is often known as the Query Identifier field 
when referring to firewall behavior (see Chapter 7). 

When a new instance of the ping program is run, the Sequence Number field 
starts with the value 0 and is increased by 1 every time a new Echo Request 
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message is sent, ping prints the sequence number of each returned packet, allow¬ 
ing the user to see if packefs are missing, reordered, or duplicafed. Recall fhaf IP 
(and consequenfly ICMP) is a best-effort dafagram delivery service, so any of fhese 
fhree condifions can occur. ICMP does, however, include a dafa checksum nof 
provided by IP. 

The ping program also fypically includes a copy of fhe local fime in fhe 
opfional dafa area of oufgoing echo requesfs. This fime, along wifh fhe resf of fhe 
confenfs of fhe dafa area, is refurned in an Echo Response message. The ping pro¬ 
gram nofes fhe currenf fime when a response is received and subfracfs fhe fime 
in fhe reply from fhe currenf fime, giving an esfimafe of fhe RTT fo reach fhe hosf 
fhaf was "pinged." Because only fhe original sender's nofion of fhe currenf fime is 
used, fhis feafure does nof require any synchronizafion befween fhe clocks af fhe 
sender and receiver. A similar approach is used by fhe traceroute fool for ifs 
RTT measuremenfs. 

Early versions of fhe ping program operafed by sending an Echo Requesf 
message once per second, prinfing each refurning echo reply. Newer implemenfa- 
fions, however, have increased fhe variabilify in oufpuf formafs and behaviors. 
On Windows, fhe defaulf is fo send four echo requesfs, one per second, prinf some 
sfafisfics, and exif; fhe -t opfion is required fo allow fhe Windows ping applica- 
fion fo confinue unfil stopped by fhe user. On Linux, fhe behavior is fhe fradifional 
one—fhe defaulf is fo run unfil inferrupfed by fhe user, sending an echo requesf 
each second and prinfing any responses. Many ofher varianfs of ping have been 
developed over fhe years, and fhere are several ofher sfandard opfions. Wifh some 
versions of fhe applicafion, a large packef can be consfrucfed fo confain special 
dafa pafferns. This has been used fo look for dafa-dependenf errors in nefwork 
communicafions equipmenf. 

In fhe following example, we send an ICMPv4 Echo Requesf fo fhe subnef 
broadcasf address. This parficular version of fhe ping applicafion (Linux) requires 
us fo specify fhe -b flag fo indicafe fhaf if is indeed our infenfion (and if gives us 
a warning regarding fhis, because if can generafe a subsfanfial volume of nefwork 
fraffic) fo use fhe broadcasf address: 

Linux% ping -b 10.0.0.127 

WARNING: pinging broadcast address 

PING 10.0.0.127 (10.0.0.127) from 10.0.0.1 : 56(84) bytes of data. 

64 bytes from 10.0.0.1: icmp_seq=0 ttl=255 time=1.290 msec 
64 bytes from 10.0.0.6: icmp_seq=0 ttl=64 time=1.853 msec (DUP!) 

64 bytes from 10.0.0.47: icmp_seq=0 ttl=64 time=2.311 msec (DUP!) 

64 bytes from 10.0.0.1: icmp_seq=l ttl=255 time=382 usee 
64 bytes from 10.0.0.6: icmp_seq=l ttl=64 time=1.587 msec (DUP!) 

64 bytes from 10.0.0.47: icmp_seq=l ttl=64 time=2.406 msec (DUP!) 

64 bytes from 10.0.0.1: icmp_seq=2 ttl=255 time=380 usee 
64 bytes from 10.0.0.6: icmp_seq=2 ttl=64 time=1.573 msec (DUP!) 

64 bytes from 10.0.0.47: icmp_seq=2 ttl=64 time=2.394 msec (DUP!) 

64 bytes from 10.0.0.1: icmp_seq=3 ttl=255 time=389 usee 
64 bytes from 10.0.0.6: icmp_seq=3 ttl=64 time=1.583 msec (DUP!) 

64 bytes from 10.0.0.47: icmp_seq=3 ttl=64 time=2.403 msec (DUP!) 
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- 10.0.0.127 ping statistics - 

4 packets transmitted, 4 packets received, 

+8 duplicates, 0% packet loss 

round-trip min/avg/max/mdev = 0.380/1.545/2.406/0.765 ms 

Here, 4 outgoing Echo Request messages are sent and we see 12 responses. 
This behavior is typical of using the broadcast address: all receiving nodes are 
compelled to respond. We therefore see fhe sequence numbers 0, 1, 2, and 3, buf 
for each one we see 3 responses. The (DUP !) nofafion indicafes fhaf an Echo Reply 
has been received confaining a Sequence Number field idenfical fo a previously 
received one. Observe fhaf fhe TTL values are differenf (255 and 64), suggesfing 
fhaf differenf kinds of compufers are responding. 

Nofe fhaf fhis procedure (sending echo requesfs fo fhe IPv4 broadcasf address) 
can be used fo quickly populafe fhe local sysfem's ARP fable (see Chapfer 4). 
Those sysfems responding fo fhe Echo Requesf message form an Echo Reply mes¬ 
sage direcfed af fhe sender of fhe requesf. When fhe reply is desfined for a sysfem 
on fhe same subnef, an ARP requesf is issued looking for fhe link-layer address 
of fhe originafor of fhe requesf. In so doing, ARP is exchanged befween every 
responder and fhe requesf sender. This causes fhe sender of fhe Echo Requesf 
message fo learn fhe link-layer addresses of all fhe responders. In fhis example, 
even if fhe local sysfem had no link-layer address mappings for fhe addresses 
10.0.0.1, 10.0.0.6, and 10.0.0.47, fhey would all be presenf in fhe ARP fable 
affer fhe broadcasf. Nofe fhaf refurning Echo Reply messages fo requesfs senf fo 
fhe broadcasf address is opfional. By defaulf, Linux sysfems refurn such replies 
and Windows XP sysfems do nof. 

8.4.2 Router Discovery: Router Solicitation and Advertisement (ICMPv4 Types 9,10) 

In Chapter 6, we looked at how DHCP can be used for a host to acquire an IP 
address and learn about the existence of nearby routers. An alternative option we 
mentioned for learning about routers is called Router Discovery (RD). Although 
specified for configuring both IPv4 and IPv6 hosts, it is not widely used with IPv4 
because of widespread preference for DHCP. However, it is now specified for use 
in conjunction with Mobile IP, so we provide a brief description. The IPv6 version 
forms part of the IPv6 SLAAC function (see Chapter 6) and is logically part of 
IPv6 ND. Therefore, we shall return to discussing it in the broader context of ND 
in Section 8.5. 

Router Discovery for IPv4 is accomplished using a pair of ICMPv4 informa¬ 
tional messages [RPC1256]: Router Solicitation (RS, type 10) and Router Advertise¬ 
ment (RA, type 9). The advertisements are sent by routers in two ways. Pirst, they 
are periodically multicast on the local network (using TTL = 1) to the All Hosts 
multicast address (224.0.0.1), and they are also provided to hosts on demand that 
ask for them using RS messages. RS messages are sent using multicast to the All 
Routers multicast address (224.0.0.2). The primary purpose of Router Discovery 
is for a host to learn about all the routers on its local subnetwork, so that it can 
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choose a default route among them. It is also used to discover the presence of rout¬ 
ers that are willing to act as Mobile IP home agents. See Chapter 9 for details on 
local network multicast. Figure 8-17 shows the ICMPv4 RA message format, which 
includes a list of the IPv4 addresses that can be used by a host as a default router. 


0 


15 16 


31 


Basic 

Router 

Discovery 

Message 


Type (9) 

Code (0) 

Checksum 

Number of 
Addresses 

Address Entry 
Size 

Lifetime (seconds) 


Router Address [1] 
Preference Levei [1] 


Router Address [2] 
Preference Levei [2] 


Router Address [W] 
Preference Levei [W] 


Type = 16 


Length 


Sequence Number 


Registration Lifetime 


Resv (0) 
(5 bits) 


Care-of Address [1] 


Care-of Address [2] 


Optionai Extensions 


Router 

Discovery with 
Mobiie iP 
Support and 
Extensions 
[RFC5944] 


Figure 8-17 The ICMPv4 Router Advertisement message includes a list of IPv4 addresses of routers that can 
be used as default next hops. The preference level lets network operators arrange for some order¬ 
ing of preferences to be applied with respect to the list (higher is more preferred). Mobile IPv4 
[RFC5944] augments RA messages with extensions in order to advertise MIPv4 mobility agents 
and the prefix lengths of the advertised router addresses. 


In Figure 8-17, the Number of Addresses field gives the number of router address 
blocks in the message. Each block contains an IPv4 address and accompanying 
preference level. The Address Entry Size field gives the number of 32-bit words per 
block (two in this case). The Lifetime field gives the number of seconds for which 
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the list of addresses should be considered valid. The preference level is a 32-bif 
signed fwo's-complemenf infeger for which higher values indicafe greafer prefer¬ 
ence. The defaulf preference level is 0; fhe special value 0x80000000 indicafes an 
address fhaf should nof be used as a valid defaulf roufer. 

RA messages are also used by Mobile IP [RFC5944] for a node fo locafe a 
mobilify (i.e., home and/or foreign) agenf. Figure 8-17 depicfs a Roufer Adverfise- 
menf message including a Mobilify Agenf Adverfisemenf exfension. This exfen- 
sion follows fhe convenfional RA informafion and includes a Type field wifh value 
16 and a Length field giving fhe number of byfes in fhe exfension area (nof includ¬ 
ing fhe Type and Length fields). Ifs value is equal fo (6 + 4R), assuming fhaf K care- 
of addresses are included. The Sequence Number field gives fhe number of such 
exfensions produced by fhe agenf since inifializafion. The regisfrafion gives fhe 
maximum number of seconds during which fhe sending agenf is willing fo accepf 
MIPv4 regisfrafions (OxFFFF indicafes infinify). There are a number of Flags bif 
fields wifh fhe following meanings: R (regisfrafion required for MIP services), B 
(agenf is foo busy fo accepf new regisfrafions), Li (agenf is willing fo acf as home 
agenf), F (agenf is willing fo acf as foreign agenf), M (fhe minimum encapsulafion 
formaf [RFC2004] is supporfed), G (fhe agenf supporfs GRE funnels for encapsu- 
lafed dafagrams), r (reserved zero), T (reverse funneling [RFC3024] is supporfed), 
U (UDP funneling [RFC3519] is supporfed), X (regisfrafion revocafion [RFC3543] 
is supporfed), and I (foreign agenf supporfs regional regisfrafion [RFC4857]). 

In addifion fo fhe Mobilify Agenf Adverfisemenf exfension, one ofher exfen¬ 
sion has been designed fo help mobile nodes. The Prefix-Lengfhs exfension may 
follow a Mobilify Agenf Adverfisemenf exfension and indicafes fhe prefix lengfh 
of each corresponding roufer address provided in fhe base roufer adverfisemenf. 
The formaf is shown in Figure 8-18. 
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31 


Type (19) 

Length 

Prefix Length [1] 

Prefix Length [2] 

Prefix Length [3] 


Prefix Length [N] 



Follows Mobility 
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Figure 8-18 The ICMPv4 optional RA Prefix-Lengths extension gives the number of significant prefix bits for 
each of the N router addresses present in the basic Router Advertisement portion of the message. 
This extension follows the Mobility Agent Advertisement extension, if present. 


In Figure 8-18, the Length field is set equal to N, the Number of Addresses field 
from the basic RA message. Each 8-bit Prefix Length field gives the number of bits 
in the corresponding Router Address field (see Figure 8-17) in use on the local sub¬ 
network. This extension can be used by a mobile node to help determine whether 
it has moved from one network to another. Using algorithm 2 of [RFC5944], a 
mobile node may cache the set of prefixes available on a particular link. A move 
can be detected if the set of network prefixes has changed. 
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8.4.3 Home Agent Address Discovery Request/Reply (ICMPv6 Types 144/145) 

[RFC6275] defines four ICMPv6 messages used in supporf of MIPv6. Two of fhe 
ICMPv6 messages are used for dynamic home agenf address discovery, and fhe 
ofher fwo are used for renumbering and mobile configurafion. The Home Agenf 
Address Discovery Requesf message is used by an MIPv6 node when visifing a 
new nefwork fo dynamically discover a home agenf (see Figure 8-19). 


0 1516 31 


Type (144) 

Code (0) 

Checksum 

Identifier 

Reserved (0) 


Figure 8-19 The MIPv6 Home Agent Address Discovery Request message contains an identifier that 
is returned in the response. It is sent to the Home Agents anycast address for the mobile 
node's home prefix. 


The message is sent to the MIPv6 Home Agents anycast address for its home 
prefix. The IPv6 source address is typically the care-of address—the address a 
mobile node has acquired on the network it is currently visiting (see Chapter 5). A 
Home Agent Address Discovery Response message (see Figure 8-20) is sent by a 
node willing to act as a home agent for the given node and its home prefix. 


0 1516 31 


Type (145) 

Code (0) 

Checksum 

Identifier 

Reserved (0) 


Home Agent Address 


Figure 8-20 The MIPv6 Home Agent Address Discovery Reply message contains the identifier from 
the corresponding request and one or more addresses of a home agent willing to for¬ 
ward packets for the mobile node. 


The home agent address is provided directly to the mobile node's unicast 
address, which is most likely a care-of address. These messages are intended to 
handle cases where a mobile node's HA has changed while transitioning between 
networks. After reestablishing an appropriate HA, the mobile may initiate MIPv6 
binding updates (see Chapter 5). 
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8.4.4 Mobile Prefix Solicitation/Advertisement (ICMPv6 Types 146/147) 

The Mobile Prefix Solicitation message (see Figure 8-21) is used to solicit a routing 
prefix update from an HA when a node's home address is about to become invalid. 
The mobile includes a Home Address option (IPv6 Destination Options; see Chap¬ 
ter 5) and protects the solicitation using IPsec (see Chapter 18). 


0 1516 31 


Type (146) 

Code (0) 

Checksum 

Identifier 

Reserved (0) 


Figure 8-21 The MIPv6 Mobile Prefix Solicitation message is sent by a mobile node when away to 
request a home agent to provide a mobile prefix advertisement. 


The solicitation message includes a random value in the Identifier field, used 
to match requests with replies. It is similar to a Router Solicitation message but is 
sent to a mobile node's HA instead of to the local subnetwork. In the advertise¬ 
ment form of this message (see Figure 8-22), the encapsulating IPv6 datagram 
must include a type 2 routing header (see Chapter 5). The Identifier field contains 
a copy of the identifier provided in the solicitation message. The M {Managed 
Address) field indicates that hosts should use stateful address configuration and 
avoid autoconfiguration. The O (Other) field indicates that information other than 
the address is provided by a stateful configuration method. The advertisement 
then contains one or more Prefix Information options. 
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Type (147) 

Code (0) 

Checksum 


Identifier 

MC 

Reserved (0) 


Options (If Any) 


Figure 8-22 The MIPv6 Mobile Prefix Advertisement message. The Identifier field matches the cor¬ 
responding field in the solicitation. The M {Managed) flag indicates that the address is 
provided by a stateful configuration mechanism. The O {Other) flag indicates that other 
information beyond the address is supplied by stateful mechanisms. 


The Mobile Prefix Advertisement message is designed to inform a traveling 
mobile node that its home prefix has changed. This message is normally secured 
using IPsec (see Chapter 18) in order to help a mobile node protect itself from 
spoofed prefix advertisements. The Prefix Information option, which uses the 
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format described in [RFC4861], contains the prefix(es) the mobile node should use 
for configuring its home address (es). 

8.4.5 Mobile IPv6 Fast Handover Messages (ICMPvG Type 154) 

A variant of MIPv6 defines/ast handovers [RFC5568] for MIPv6 (called FMIPv6). It 
specifies methods for improving the IP-layer handoff latency when a mobile node 
moves from one network access point (AP) to another. This is accomplished by 
predicting the routers and addressing information that will be used prior to the 
handoff taking place. The protocol involves the discovery of so-called proxy rout¬ 
ers, which behave like routers a mobile is likely to encounter after it is handed off 
to a new network. There are corresponding ICMPv6 Proxy Router Solicitation and 
Advertisement messages (called RtSolPr and PrRtAdv, respectively). The basic for¬ 
mat of the RtSolPr and PrRtAdv messages is given in Figure 8-23. 


0 15 16 31 


Type (154) 

Code 

Checksum 

Subtype 

Reserved (0) 

Identifier 

Options 


Figure 8-23 The common ICMPv6 message type used for FMIPv6 messages. The Code and Subtype 
fields give further information. Solicitation messages use code 0 and subtype 2 and may 
include the sender's link-layer address and the link-layer address of its preferred next 
access point (if known) as options. Advertisements use codes 0-5 and subtype 3. The dif¬ 
ferent code values indicate the presence of various options, whether the advertisement 
was solicited, if the prefix or router information has changed, and the handling of DHCP. 


A mobile node may have some information available regarding the addresses 
or identifiers of APs it will use in the future (e.g., by "scanning" for 802.11 net¬ 
works). A RtSolPr message uses code 0 and subtype 2 and must contain at least 
one option, the New Access Point Link-Layer Address option. This is used to indi¬ 
cate which AP the mobile is requesting information about. The RtSolPr message 
may also contain a Link-Layer Address option identifying the source, if known. 
These options use the IPv6 ND option format, so we shall defer discussion of them 
until we look at ND in detail. 

8.4.6 Multicast Listener Query/Report/Done (ICMPv6 Types 130/131/132) 

Multicast Listener Discovery (MLD) [RFC2710][RFC3590] provides management of 
multicast addresses on links using IPv6. It is similar to the IGMP protocol used by 
IPv4, described in Chapter 9. That chapter deals with the operation of IGMP and the 
use of this ICMPv6 message in detail; here we describe the message formats that 
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constitute MLD (version 1), including the Multicast Listener Query, Report, and Done 
messages. The basic format is given in Figure 8-24. These messages are sent with an 
IPv6 Hop Limit field value of 1 and fhe Roufer Alerf Hop-by-Hop IPv6 opfion. 


0 15 16 31 


Type 

(130, 131, or 132) 

Code (0) 

Checksum 

Maximum Response Delay (ms) 

Reserved (0) 


Multicast (Group) Address 


(128 bits) 


Common 
MLD 
Format 
(24 bytes) 


Figure 8-24 ICMPv6 MLD version 1 messages are all of this form. Queries (type 130) are either 
general or multicast-address-specific. General queries ask hosts to report which mul¬ 
ticast addresses they have in use, and address-specific queries are used to determine 
if a specific address is (still) in use. The maximum response time gives the maximum 
number of milliseconds a host may delay sending a report in response to a query. The 
destination multicast address is 0 for general queries and the multicast address in ques¬ 
tion for specific reports. For Report (type 131) and Done messages (type 132), it includes 
the address related to the report or what address is no longer of interest, respectively. 


The main purpose of MLD is for mulficasf roufers fo learn fhe mulficasf 
addresses used by fhe hosfs on each link fo which fhey are mufually affached. 
MLDv2 (described in fhe nexf secfion) exfends fhis capabilify by allowing hosfs 
fo specify parficular hosfs from which fhey wish fo (or nof fo) receive fraffic. Two 
forms of MLD queries are senf by mulficasf roufers: general queries and multi- 
cast-address-specific queries. Generally, roufers send fhe query messages and hosfs 
respond wifh reporfs, eifher in response fo fhe queries, or unsolicifed if a hosf's 
mulficasf address membership changes. 

The Maximum Response Time field, nonzero only in queries, gives fhe maxi¬ 
mum number of milliseconds a hosf may delay sending a reporf in response 
fo a query. Because fhe mulficasf roufer need only know fhaf at least one hosf is 
inferesfed in fraffic desfined for a parficular mulficasf address (because link-layer 
mulficasf supporf allows fhe roufer fo nof have fo replicafe fhe message for each 
desfinafion), nodes may infenfionally and randomly delay fheir reporfs, suppress¬ 
ing fhem enfirely if fhey nofice fhaf anofher neighbor has responded already. 
This field provides an upper bound on how long fhis delay may be. The Multi¬ 
cast Address field is 0 for general queries and fhe address for which fhe roufer is 
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interested in reports otherwise. For MLD Report messages (type 131) and MLD 
Done messages (type 132) it includes the address related to the report or what 
address is no longer of interest, respectively 

8.4.7 Version 2 Multicast Listener Discovery (MLDv2) (ICMPv6 Type 143) 

[RFC3810] defines exfensions fo fhe MLD facilify described in [RFC2710]. In parficu- 
lar, if defines a way for a mulficasf lisfener fo specify a desire fo hear from only one 
specific sef of senders (or, alfernafively, fo exclude one specific sef). If is fherefore 
useful in supporfing source-specific mulficasf (SSM; see Chapfer 9 and [RFC4604] 
[RFC4607]). If is basically a franslafion of fhe IGMPv3 protocol used wifh IPv4 for use 
wifh IPv6, which uses ICMPv6 for mosf mulficasf address managemenf. Therefore, 
we will describe fhe message formaf here, buf fhe defailed operafion of mulficasf 
address dynamics is covered in Chapfer 9. MLDv2 extends fhe MLD Query message 
wifh addifional informafion perfaining fo specific sources (see Figure 8-25). The 
firsf 24 bytes of fhe message are idenfical fo fhe common MLD formaf. 

The Maximum Response Code field specifies fhe maximum fime allowed before 
sending an MLD Response message. The value of fhis field is special and fherefore 
is inferprefed slighfly differenfly fhan in MLDvl: if if is less fhan 32,768, fhe maxi¬ 
mum response delay is sef equal fo fhe value (in milliseconds) as in MLDvl. If fhe 
value is equal fo or greater fhan 32,769, fhe field encodes a floafing-poinf number 
using fhe formaf shown in Figure 8-26. 

In fhis case, fhe maximum response delay is sef equal fo {{mant I 0x1000) « 
{exp + 3)) ms. The reason for fhis seemingly complex encoding sfrafegy is fo allow 
small and large values of fhe response delay fo be encoded in fhis field and refain 
some compafibilify wifh MLDvl. In parficular, if allows for carefully adjusfing fhe 
leave lafency and affecfing fhe reporf bursfiness (see Chapfer 9). 

The Multicast Address field is sef fo 0 for a general query. For a mulficasf- 
address-specific query or mulficasf-address- and source-specific query if is sef 
fo fhe mulficasf address being queried. The S field indicates whefher roufer-side 
processing should be suppressed. When sef, if indicates fo any receiving mulficasf 
router fhaf if musf suppress fhe normal fimer updates computed when hearing a 
query. If does nof indicate fhaf querier elecfion or normal "hosf-side" processing 
should be suppressed if fhe router is ifself a mulficasf lisfener. 

The QRV {Querier Robustness Variable) field, if sef, confains a value of no more 
fhan 7. If fhe sender's infernal QRV value exceeds 7, fhis field is sef fo 0. Robusfness 
variables, described in Chapfer 9, are used fo fine-fune fhe rate of MLD updates 
based on an expecfafion of packef loss on a subnefwork. The QQIC {Querier's 
Query Interval Code) field encodes fhe query interval and is shown in Figure 8-27. 

The query interval, measured in seconds, is computed from fhe QQIC field as fol¬ 
lows: if QQIC < 128, fhen QQI = QQIC; ofherwise, QQI = {{mant I 0x10) « {exp + 3)). 

The Number of Sources (N) field indicafes fhe number of source addresses 
presenf in fhe query. This field confains 0 for a general query or for a mulficasf- 
address-specific query. If is nonzero for mulficasf-address- and source-specific 
query messages. 
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31 


Type (130) 

Code (0) 

Checksum 

Maximum Response Code 

Reserved (0) 



Multicast (Group) 



Address (128 bits) 

Resv (0) c 

(4 bits) 

. QRV 

^ (3 bits) 

QQIC (8 bits) 

Number of Sources (N) 

(16 bits) 


Source Address [1] 


Common 
MLD 
Format 
(24 bytes) 


Source Address [N] 


Figure 8-25 The MLDv2 Query message format, which is compatible with the MLD version 1 mes¬ 
sage common format. The major difference is the capability to limit or exclude specific 
multicast sources from the host's list of interests. 


01 3456789ABCDEF 


1 exp 


mant 


-16 bits- 


Figure 8-26 Floating-point format used with MLDv2 Query messages when the Maximum Response Code 
value is at least 32,768. In these cases, the delay is set to {{mant I 0x1000) « {exp + 3))ms. 
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01 34567 

1 exp mant 
-^8 bits—► 

Figure 8-27 The MLDv2 Querier's Query Interval Code encodes the interval between MLDv2 queries. 

The (unencoded) version of this value is called the Querier's Query Interval and is mea¬ 
sured in seconds. The QQl is computed as follows: QQl = QQIC (if QQIC < 128) and QQl 
= {{mant I 0x10) « (exp + 3)) otherwise. 

The multicast address records used in the MLDv2 reports (see Figures 8-28 
and 8-29) contain indicators of modifications to the source address filter being 
used by an IPv6 node (see Chapter 9 on multicast for more informafion on fhe 
operafion of such fibers, which describe sefs of sending hosfs fhaf are or are nof of 
inferesf fo a parficular receiving hosf). 



Multicast Address Record [M\ 



Figure 8-28 The MLDv2 Report message includes a vector of multicast address records. 
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0 15 16 31 



Figure 8-29 A multicast address (group) record. Multiple such records may be present in an MLDv2 
Report message. The Record Type field is one of the following: MODE_IS_INCLUDE, 
MODE_lS_EXCLUDE, CHANGE_TO_lNCLUDE_MODE, CHANGE_TO_EXCLUDE_ 
MODE, ALLOW_NEW_SOURCES, or BLOCK_OLD_SOURCES. LW-MLDv2 simplifies 
MLDv2 by removing the EXCLUDE modes. The Aux Data Len field contains the amount 
of auxiliary data present in the record, in 32-bit-word units. Eor MLDv2, as specified in 
[RFC3810], this field must contain the value 0, indicating no auxiliary data. 






394 


ICMPv4 and ICMPv6: Internet Control Message Protocol 


The record types fall into three primary categories: current state records, fil¬ 
ter mode change records, and source list change records. The first category includes 
the MODE_IS_INCLUDE (IS_IN) and MODE_IS_EXCLUDE (IS_EX) types, which 
indicate that the filter mode for the address is "include" or "exclude," respectively, 
for the specified sources (at least one of which must be present). The filter mode 
change types CHANGE_TO_INCLUDE (TO_IN) or CHANGE_TO_EXCLUDE 
(TO_EX) types are similar to the current state records but are sent when there is a 
change and need not include a nonempty source list. The source list change types, 
ALLOW_NEW_SOURGES (ALLOW) and BLOGK_OLD_SOURGES (BLOGK), 
are used when the filter state (include/exclude) is not changed but only the list 
of sources is modified. A modification to MLDv2 (and IGMPv3) removes the 
EXGLUDE modes in order to simplify the operation of MLDv2 [REG5790]. This 
"lightweight" approach, called LW-MLDv2 (and LW-IGMPv3), uses the same 
previously defined message formats but removes support for the seldom-used 
EXGLUDE directives that require multicast routers to keep additional state. 

8.4.8 Multicast Router Discovery (MRD) (IGMP Types 48/49/50, ICMPvG Types 
151/152/153) 

[REG4286] describes Multicast Router Discovery (MRD), a method defining special 
messages that can be used with IGMPv6 and IGMP to discover the presence of 
routers capable of forwarding multicast packets and some of their configuration 
parameters. It is envisioned primarily for use in conjunction with "IGMP/MLD 
snooping." IGMP/MLD snooping is a mechanism by which systems other than 
hosts and routers (e.g., layer 2 switches) can also learn about the location of net¬ 
work layer multicast routers and interested hosts. We discuss it in more detail in 
the context of IGMP in Ghapter 9. MRD messages are always sent with the IPv4 
TTL or IPv6 Hop Limit field set to 1 with a Router Alert option and may be one of 
the following types: Advertisement (151), Solicitation (152), or Termination (153). 
Advertisements are sent periodically at a configured interval to indicate a router's 
willingness to forward multicast traffic. The Termination message indicates the 
cessation of such willingness. Solicitation messages may be used to induce routers 
to produce Advertisement messages. The Advertisement message format is shown 
in Pigure 8-30. 

The Advertisement message is sent from the router's IP address (a link-local 
address for IPv6) to the All Snoopers IP address: 224.0.0.106 for IPv4 and the link- 
local multicast address ff02::6a for IPv6. A receiver is able to learn the router's 
advertising interval and MLD parameters (QQI and QRV, described in more detail 
in Ghapter 9). Note that the QQI value is the query interval (in seconds), and not 
the QQIG (encoded version of the QQI value) as previously described for MLDv2 
queries. 

The formats of Solicitation and Termination messages are nearly the same (see 
Pigure 8-31), differing only in the value of the Type field. 
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0 15 16 31 


Type 

(IPv4: 0x30; IPv6: 151) 

Adv Interval 
(seconds) 

Checksum 

Query Interval (QQI) 

Robustness Variable 


Figure 8-30 The MRD Advertisement message (ICMPv6 type 151; IGMP type 48) contains the 
advertisement interval (in seconds) indicating how often unsolicited advertisements 
are sent, the sender's query interval (QQI), and the robustness variable as defined by 
MLD. The IP address of the sender is used to indicate to a receiver the router that is able 
to forward multicast traffic. The message is sent to the All Snoopers multicast address 
(IPv4, 224.0.0.106; IPv6, ff02::6a). 


0 15 16 31 


Type 

{IPv4: 0x31, 0x32; 

Reserved (0) 

Checksum 

IPv6: 152, 153) 




Figure 8-31 The ICMPv6 MRD Solicitation (ICMPv6 type 152; IGMP type 49) and Termination 
(ICMPv6 type 153; IGMP type 50) messages use a common format. MRD messages set 
the IPv6 Hop Limit field or IPv4 TTL field to 1 and include the Router Alert option. 
Solicitations are sent to the All Routers multicast address (IPv4, 224.0.0.2; IPv6, ff02::2). 


Figure 8-31 shows the (nearly) common format used for Solicifafion and Ter- 
minafion messages. The Solicifafion message induces a mulficasf roufer fo send 
an Adverfisemenf message on demand. Such messages are senf fo fhe All Roof¬ 
ers address: 224.0.0.2 for IPv4 and fhe link-local mulficasf address ff02::2 for IPv6. 
Terminafion messages are senf fo fhe All Snoopers IP address fo indicafe fhaf fhe 
sending roufer is no longer willing fo forward mulficasf fraffic. 


8.5 Neighbor Discovery in IPv6 

The Neighbor Discovery Protocol in IPv6 (somefimes abbreviafed as NDP or 
ND) [RFC4861] brings fogefher fhe Roufer Discovery and Redirecf mechanisms 
of ICMPv4 wifh fhe address-mapping capabilifies provided by ARP. If is also 
specified for use in supporfing Mobile IPv6. In confrasf fo ARP and IPv4, which 
generally use broadcasf addressing (excepf for Roufer Discovery), ICMPv6 makes 
exfensive use of mulficasf addressing, af bofh fhe nefwork and link layers. (Recall 
from Chapfers 2 and 5 fhaf IPv6 does nof even have broadcasf addresses.) 

ND is designed fo allow nodes (roufers and hosfs) on fhe same link or seg- 
menf fo find each ofher, determine if fhey have bidirecfional connecfivify, and 
defermine if a neighbor has become inoperafive or unavailable. If also supporfs 
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stateless address autoconfiguration (see Chapter 6). All of fhe ND funcfionalify is 
provided by ICMPv6 af or above fhe nefwork layer, making if largely independenf 
of fhe parficular link-layer fechnology employed underneafh. However, ND does 
prefer fo make use of link-layer mulficasf capabilifies (see Chapfer 9), and for fhis 
reason operafion on non-broadcasf- and non-mulficasf-capable link layers (called 
non-broadcasf mulfiple access or NBMA links) may differ somewhaf. 

The fwo main parfs of ND are Neighbor Solicifafion/Adverfisemenf (NS/NA), 
which provides fhe ARP-like funcfion of mapping befween nefwork- and link- 
layer addresses, and Roufer Solicifafion/Adverfisemenf (RS/RA), which provides 
fhe funcfions of roufer discovery. Mobile IP agenf discovery, and redirecfs, as 
well as some supporf for aufoconfigurafion. A secure varianf of ND called SEND 
[RFC3971] adds aufhenficafion and special forms of addressing, primarily by 
infroducing addifional ND opfions. 

ND messages are ICMPv6 messages senf using an IPv6 Hop Limit field value 
of 255. Receivers verify fhaf incoming ND messages have fhis value fo profecf 
againsf off-link senders fhaf may affempf fo spoof local ICMPv6 messages (such 
messages would arrive wifh values less fhan 255). ND has a rich sef of opfions fhaf 
messages may carry. Firsf we discuss fhe primary message fypes and fhen defail 
fhe available opfions. 

8.5.1 ICMPv6 Router Solicitation and Advertisement (ICMPv6 Types 133,134) 

Router Advertisement (RA) messages indicate the presence and capabilities of a 
nearby router. They are sent periodically by routers, or in response to a Router 
Solicitation (RS) message. The RS message (see Figure 8-32) is used to induce 
on-link routers to send RA messages. RS messages are sent to the All Routers 
multicast address, ff02::2. A Source Link-Layer Address option is supposed to be 
included if the sender of the message is using an IPv6 address other than the 
unspecified address (used during aufoconfigurafion). It is the only valid option 
for such messages as of [RFC4861]. 


0 1516 31 


Type (133) 

Code (0) 

Checksum 

i 

Reserved (0) 

Options 


Figure 8-32 The ICMPv6 Router Solicitation message is very simple but ordinarily contains a Source 
Link-Layer Address option (unlike its lCMPv4 counterpart). It may also contain an 
MTU option if an unusual MTU value is in use on the link. 
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The Router Advertisement (RA) message (see Figure 8-33) is sent by routers to 
the All Nodes multicast address (ff02::l) or the unicast address of the requesting 
host, if fhe adverfisemenf is senf in response fo a solicifafion. RA messages inform 
local hosfs and ofher roufers of configurafion defails relevanf fo fhe local link. 
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Type (134) 

Code (0) 

Checksum 

Current Hop Limit 

M 0 H Pref P 

Router Lifetime 

Reachabie Time 

Retransmission Timer 


Options 


Figure 8-33 An ICMPv6 Router Advertisement message is sent to the All Nodes multicast address 
(ff02::l). Receiving nodes check to make sure that the Hop Limit field is 255, ensuring 
that the packet has not been forwarded through a router. The message includes three 
flags: M {Managed address configuration), O {Other stateful configuration), and H 
{Home Agent). 


The Current Hop Limit field specifies fhe defaulf hop limif hosfs are supposed 
fo use for sending IPv6 dafagrams. A value of 0 indicafes fhaf fhe sending roufer 
does nof care. The nexf byfe confains a number of bif fields, as summarized and 
exfended in [RFC5175]. The M (Managed) field indicafes fhaf fhe local assignmenf 
of IPv6 addresses is handled by sfafeful configurafion, and fhaf hosfs should avoid 
using sfafeless aufoconfigurafion. The O (Other) field indicafes fhaf ofher sfafe¬ 
ful informafion (fhaf is, ofher fhan IPv6 addresses) uses a sfafeful configurafion 
mechanism (see Chapfer 6). The H (Home Agent) field indicafes fhaf fhe sending 
roufer is willing fo acf as a home agenf for Mobile IPv6 nodes. The Pref (Prefer¬ 
ence) field gives fhe level of preference for fhe sender of fhe message fo be used 
as a defaulf roufer as follows: 01, high; 00, medium (defaulf); 11, low; 10, reserved 
(nof used). More defails abouf fhis field are given in [RFC4191]. The P (Proxy) flag 
is used in conjuncfion wifh fhe experimenfal ND proxy facilify [RFC4389]. If pro¬ 
vides a proxy-ARP-like capabilify (see Chapfer 4) for IPv6. 

The Router Lifetime field indicafes fhe amounf of fime during which fhe send¬ 
ing roufer can be used as a defaulf nexf hop, in seconds. If if is sef fo 0, fhe sending 
roufer should never be used as a defaulf roufer. This field applies only fo fhe use of 
fhe sending roufer as a defaulf roufer; if does nof affecf ofher opfions carried in fhe 
same message. The Reachable Time field gives fhe number of milliseconds in which 



398 


ICMPv4 and ICMPv6: Internet Control Message Protocol 


a node is to assume that another is reachable, assuming mutual communications 
have taken place. This is used by the Neighbor Unreachability Detection mechanism 
(see Section 8.5.4). The Retransmission Timer field dictates the time, in milliseconds, 
during which hosts delay sending successive ND messages. 

This message usually includes the Source Link-Layer option (if applicable) 
and should include an MTU opfion if variable-lengfh MTUs are used on fhe link. 
The roufer should also include Prefix Informafion opfions fhaf indicafe which 
IPv6 prefixes are in use on fhe local link. Chapfer 6 includes an example of how 
RS and RA messages are used (e.g., see Figures 6-24 and 6-25). 

8.5.2 ICMPv6 Neighbor Solicitation and Advertisement (IMCPv6 Types 135,136) 

The Neighbor Solicitation (NS) message in ICMPv6 (see Figure 8-34) effectively 
replaces the ARP Request messages used with IPv4. Its primary purpose is to con¬ 
vert IPv6 addresses to link-layer addresses. However, it is also used for detecting 
whether nearby nodes can be reached, and if they can be reached bidirectionally 
(that is, whether the nodes can talk to each other). When used to determine address 
mappings, it is sent to the Solicited-Node multicast address corresponding to the 
IPv6 address contained in the Target Address field (prefix f02::l:f/104, combined 
with the low-order 24 bits of the solicited IPv6 address). For more details on how 
Solicited-Node multicast addressing is used, see Chapter 9. When this message 
is used to determine connectivity to a neighbor, it is sent to that neighbor's IPv6 
unicast address instead of the Solicited-Node address. 


0 1516 31 


Type (135) Code (0) Checksum 



Reserved (0) 

Target Address 
(IPv6 Address Being Solicited) 

Options (If Any) 


Figure 8-34 The ICMPv6 Neighbor Solicitation message is similar to the RS message but contains 
a target IPv6 address. These messages are sent to Solicited-Node multicast addresses 
to provide ARP-like functionality and to unicast addresses to test reachability to other 
nodes. NS messages contain a Source Link-Layer Address option on links that use 
lower-layer addressing. 
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The NS message contains the IPv6 address for which the sender is trying 
to learn the link-layer address. The message may contain the Source Link-Layer 
Address option. This option must be included in networks that use link-layer 
addressing when the solicitation is sent to a multicast address and should be 
included for unicasf solicifafions. If fhe sender of fhe message is using fhe unspec¬ 
ified address as ifs source address (e.g., during duplicafe address defecfion), fhis 
opfion is nof fo be included. 

The ICMPv6 Neighbor Adverfisemenf (NA) message (see Figure 8-35) serves 
fhe purpose of fhe ARP Response message in IPv4 in addifion fo helping wifh 
neighbor unreachabilify defecfion (see Secfion 8.5.4). If is eifher senf as a response 
fo an NS message or senf asynchronously when a node's IPv6 address changes. If is 
senf eifher fo fhe unicasf address of fhe solicifing node, or fo fhe All Nodes mulficasf 
address if fhe solicifing node used fhe unspecified address as ifs source address. 


0 1516 31 


Type (136) Code (0) Checksum 



R s 0 Reserved (0) 

Target Address 
(IPv6 Address Being Referenced) 

Options (If Any) 


Figure 8-35 The ICMPv6 Neighbor Advertisement message contains the following flags: R indicates 
that the sender is a router, S indicates that the advertisement is a response to a solicita¬ 
tion, and O indicates that the message contents should override other cached address 
mappings. The Target Address field contains the IPv6 address of the sender of the mes¬ 
sage (generally, the unicast address of the solicited node from the ND solicitation). A 
Target Link-Layer Address option is included to enable ARP-like functionality for IPv6. 

The R (Router) field indicafes fhaf fhe sender of fhe message is a roufer. This 
could change, for example, if a roufer ceases being a roufer and becomes only a 
hosf insfead. The S (Solicited) field indicafes fhaf fhe adverfisemenf is in response fo 
a solicifafion received earlier. This field is used fo verify fhaf bidirecfional connec- 
fivify befween neighbors has been achieved. The O (Override) field indicafes fhaf 
informafion in fhe adverfisemenf should override any previously cached infor- 
mafion fhe receiver of fhe message has. If is nof supposed fo be sef for solicifed 
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advertisements, for anycast addresses, or in solicited proxy advertisements. It is 
supposed to be set in other (solicited or unsolicited) advertisements. 

For solicited advertisements, the Target Address field is fhe IPv6 address being 
looked up. For unsolicifed adverfisemenfs, if is fhe IPv6 address fhaf corresponds fo a 
link-layer address fhaf has changed. This message musf confain fhe Targef Link-Layer 
Address opfion on nefworks fhaf supporf link-layer addressing when fhe adverfise- 
menf was solicifed via a mulficasf address. We will now look af a simple example. 

8.5.2.1 Example 

Here we see fhe resulfs of using ICMPv6 Echo Requesf/Reply, in conjuncfion wifh 
NDP. The sender is a Windows XP sysfem wifh IPv6 enabled, and a packef frace 
is capfured on a nearby Linux sysfem. Some lines have been wrapped for clarify. 


C:\> pingS -s feSO::210:18ff:feOO:100b fe80::211:Ilff:fe6f:c603 

Pinging feSO::211:Ilff:fe6f:c603 

from feSO::210:18ff:feOO:100b with 32 bytes of data: 

Reply from fe80::211:Ilff:fe6f:c603: bytes=32 time<lms 
Reply from fe80::211:Ilff:fe6f:c603: bytes=32 time<lms 
Reply from fe80::211:Ilff:fe6f:c603: bytes=32 time<lms 
Reply from fe80::211:Ilff:fe6f:c603: bytes=32 time<lms 

Ping statistics for fe80::211:Ilff:fe6f:c603: 

Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), 
Approximate round trip times in milli-seconds: 

Minimum = 0ms, Maximum = 0ms, Average = 0ms 


Linux# tcpd\unp -i ethO -slSOO -w -p ip6 

tcpdump: listening on ethO, 

link-type ENIOMB (Ethernet), capture size 1500 bytes 

1 21:22:01.389656 fe80::211:Ilff:fe6f:c603 > ff02::1:ff00:100b: 

[icmp6 sum ok] icmp6: neighbor sol: who has 

fe80::210:18ff:fe00:100b 
(src lladdr: 00:11:11:6f:c6:03) 
(len 32, hlim 255) 

2 21:22:01.389845 fe80::210:18ff:feOO:100b > fe80::211:Ilff:fe6f:c603: 

[icmp6 sum ok] icmp6: neighbor adv: tgt is 

fe80::210:18ff:fe00:100b(SO) 

(tgt lladdr: 00:10:18:00:10:0b) 
(len 32, hlim 255) 

3 21:22:02.390713 fe80 :: 210 : 18ff : feOO : 100b > fe80 :: 211 : Ilff : fe6f : c603 : 

[icmp6 sum ok] icmp6: echo request seq 18 

(len 40, hlim 128) 

4 21:22:02.390780 fe80 :: 211 : Ilff : fe6f : c603 > fe80 :: 210 : 18ff : feOO : 100b: 

[icmp6 sum ok] icmp6: echo reply seq 18 

(len 40, hlim 64) 


. .. continues ... 
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The ping6 program is available on Windows XP and Linux. (Later versions 
of Windows incorporate the IPv6 functionality into the regular ping program.) 
The -s option tells it which source address to use. Recall that with IPv6 a host 
may have multiple addresses from which to choose, and here we have chosen one 
of its link-local addresses, fe80: :211:llf f:fe6f:c603. The trace shows the NS/ 
NA exchange and an ICMP Echo Request/Reply pair. Observe that all of the ND 
messages use IPv6 Hop-Limit field values of 255, and the ICMPv6 Echo Request 
and Echo Reply messages use a value of 128 or 64. 

The NS message is sent to the multicast address f f02: :1: ffOO :100b, which 
is the Solicited-Node multicast address corresponding to the IPv6 address being 
solicited (fe80: :210:18ff:fe00:100b). We see that the soliciting node also 
includes its own link-layer address, 00:ll:ll:6f:c6:03, in a Source Link-Layer 
Address option. 

The NA response message is sent using link-layer (and IP-layer) unicast 
addressing back to the soliciting node. The Target Address field contains the value 
requested in the solicitation: f e80:: 210:18f f: f eOO: 100b. In addition, we see that 
the S and O flag fields are set, indicating that the advertisement is in response to 
the earlier solicitation provided, and that the information being provided should 
override any other information the soliciting node may have cached. The R flag 
field is unset, indicating that the responding host is not acting as a router. Pinally, 
the solicited node includes the most important information in a Target Link-Layer 
Address option: the solicited node's link-layer address of 00:10:18:00:10: Ob. 

8.5.3 ICMPv6 Inverse Neighbor Discovery Solicitation/Advertisement (ICMPv6 
Types 141/142) 

The Inverse Neighbor Discovery (IND) facility in IPv6 [RPC3122] originated from a 
need to determine IPv6 addresses given link-layer addresses on Prame Relay net¬ 
works. It resembles reverse ARP, a protocol once used with IPv4 networks primarily 
for supporting diskless computers. Its main function is to ascertain the network- 
layer address(es) corresponding to a known link-layer address. Pigure 8-36 shows 
the basic format of IND Solicitation and Advertisement messages. 


0 1516 31 


Type 

(141 or 142) 

Code (0) 

Checksum 


Reserved (0) 

Options 


Figure 8-36 The ICMPv6 IND Solicitation (type 141) and Advertisement (type 142) messages have 
the same basic format. They are used to map known link-layer addresses to IPv6 
addresses in environments where this is useful. 
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The IND Solicitation message is sent to the All Nodes multicast address at 
the IPv6 layer but is encapsulated in a unicast link-layer address (the one being 
looked up). It must contain both a Source Link-Layer Address option and a Des¬ 
tination Link-Layer Address option. It may also contain a Source/Target Address 
List option and/or an MTU option. 


8.5.4 Neighbor Unreachability Detection (NUD) 

One of the important features of ND is fo defecf when reachabilify befween fwo 
sysfems on fhe same link has become losf or asymmefric (i.e., is nof available in 
bofh direcfions). This is accomplished using fhe Neighbor Unreachability Detection 
(NUD) algorifhm. If is used fo manage fhe neighbor cache presenf on each node. 
The neighbor cache is analogous fo fhe ARP cache described in Chapfer 4; if is 
a (concepfual) dafa sfrucfure fhaf holds fhe IPv6-fo-link-layer-address mapping 
informafion required fo perform direcf delivery of IPv6 dafagrams fo on-link 
neighbors as well as informafion regarding fhe sfafe of fhe mapping. Figure 8-37 
shows how if mainfains enfries in fhe neighbor cache. 



Figure 8-37 Neighbor Unreachability Detection helps maintain the neighbor cache consisting of 
several neighbor entries. Each entry is in one of five states at any given time. Confirma¬ 
tions of reachability are accomplished by receiving Neighbor Advertisement messages 
or using other higher-layer protocol information, if available. Unsolicited evidence 
includes unsolicited Neighbor and Router Advertisement messages. 




Section 8.5 Neighbor Discovery in IPv6 


403 


Each mapping may be in one of five sfafes: INCOMPLETE, REACHABLE, 
STALE, DELAY, or PROBE. The fransifion diagram in Eigure 8-37 shows fhe ini- 
fial sfafes fo be eifher INCOMPLETE or STALE. When an IPv6 node has a unicasf 
dafagram fo send fo a desfinafion, if checks ifs destination cache fo see if an enfry 
corresponding fo fhe desfinafion is presenf. If so, and fhe desfinafion is on-Iink, 
fhe neighbor cache is consulfed fo see if fhe neighbor's sfafe is REACHABLE. 
If so, fhe dafagram is senf using direcf delivery (see Chapfer 5). If no neighbor 
cache enfry is presenf buf fhe desfinafion appears fo be on-Iink, NUD enfers fhe 
INCOMPLETE sfafe and sends an NS message. Successful receipf of a solicifed NA 
message provides confirmafion fhaf fhe node is reachable, and fhe enfry enfers 
fhe REACHABLE sfafe. The STALE sfafe corresponds fo apparenfly valid enfries 
fhaf have nof yef been confirmed. This sfafe is enfered when eifher an enfry has 
nof been updafed for some fime when if was previously REACHABLE, or when 
unsolicifed informafion is received (e.g., a node has changed ifs address and senf 
an unsolicifed NA message). These cases suggesf fhaf reachabilify is possible, buf 
confirmafion in fhe form of a valid NA is sfill required. 

The of her sfafes, DELAY and PROBE, are femporary sfafes. DELAY is used when 
a packef is senf buf ND has no currenf evidence fo suggesf fhaf reachabilify is pos¬ 
sible. The sfafe gives upper-layer protocols an opporfunify fo provide addifional evi¬ 
dence. If after DELAY_EIRST_PROBE_TIME seconds (fhe consfanf 5) no evidence is 
received, fhe sfafe changes fo PROBE. In fhe PROBE sfafe, ND sends periodic NS 
messages (every RefransTimer milliseconds, wifh consfanf defaulf value RETRANS_ 
TIMER equal fo 1000). If no evidence has been received after sending MAX_UNI- 
CAST_SOLICIT NS messages (defaulf 3), fhe enfry is supposed fo be deleted. 

8.5.5 Secure Neighbor Discovery (SEND) 

SEND [REC3971] is a special sef of enhancemenfs aimed af providing addifional 
securify for ND messages. This is fo help resisf various spoofing affacks in which 
one hosf or roufer mighf masquerade as anofher (see Secfion 8.6, Chapfer 18, and 
[REC3756] for addifional defails). If specifically aims fo profecf againsf nodes mas¬ 
querading as ofhers when responding fo NS messages. SEND does nof use IPsec 
(see Chapfer 18) buf instead ifs own special mechanism. This mechanism is also 
used for securing PMIPv6 handoffs [RPC5269]. 

SEND operates in a framework wifh a sef of assumpfions. Eirsf, each SEND- 
capable roufer has a certificate, or cryptographic credenfial, fhaf if can use fo prove 
ifs idenfify fo a hosf. Nexf, each hosf is also equipped wifh a trust anchor —con- 
figurafion informafion enabling fhe credenfial fo be verified. Einally, each node 
generafes a public/privafe key pair when configuring fhe IPv6 addresses if will 
use. Defails of credenfials, frusf anchors, key pairs, and ofher associafed securify 
fechniques are given in Chapfer 18. 

8.5.5.1 Cryptographically Generated Addresses (CGAs) 

Perhaps ifs mosf inferesfing feafure, SEND uses an enfirely differenf fype of IPv6 
address called a cryptographically generated address (CGA) [RPC3972][RPC4581] 
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[RFC4982]. This type of address is based on a node's public key information, thereby 
linking the address to the node's credential. Consequently, a node or address owner 
in possession of the corresponding private key is able to prove it is the authorized 
user of a particular CGA. CGAs also encode the subnet prefix with which they 
are associated so they cannot be moved trivially from one subnet to another. This 
approach is quite different from how addresses are typically assigned. 

An IPv6 CGA is generated by ORing a 64-bit subnet prefix with a specially 
constructed interface identifier. The CGA interface identifier is computed using 
a secure hash function (a hash function believed difficult to invert; see Ghapter 18) 
called Hashl with inputs derived from the node's public key and a special GGA 
parameters data structure. These parameters are also used as input to another 
secure hash function. Hash!, which provides a hash extension technique that effec¬ 
tively extends the number of bits of output for the hash function, increasing its 
security (i.e., strength against an adversary producing a different input resulting 
in the same hash value) [AOS] [RFG6273]. The GGA technique allows for the address 
owner's public key to be self-generated, so this approach can be used without an 
accompanying public key infrastructure (PKI) or other trusted third party. 

The GGA parameters data structure is shown in Figure 8-38. The Modifier field 
is initialized with a random value, and the Collision Count field is initialized to 0. 
The structure includes an Extension Fields area that can be adapted for future uses 
[RFG4581]. 


0 


1516 


31 


Collision Count 






Modifier 
{16 bytes) 

Subnet Prefix 
(8 bytes) 

Public Key 
(variable) 

Extension Fields 
(optional, variable) 


CGA Parameters 


Hashl : Top-most 64 bits of secure 
hash function over all field values, 
including the final Modifier value found 
when computing Hash2. 


Hash2 : Top-most 112 bits of secure hash 
function over field values where the Subnet 
Prefix and Collision Count are zero. Hash2 
is recomputed by incrementing the Modifier 
field until the first (16*Sec) bits are zero. 


Subnet Prefix (64 bits) 


Sec (3 bits) 


Hashl value (59 bits) 


U = 0/g = 0 

(2 bits) 


CGA 


Figure 8-38 The SEND method for computing CGAs. The CGA parameters data structure is used as input to 
two cryptographic hash functions, Hashl and Hash2. The Hash2 value must have (16*Sec) initial 
0 bits, where Sec is a 3-bit parameter. The Modifier is changed until Hash2 computes appropriately. 
The resulting values are used to compute Hashl, which is combined with Sec and the subnet 
prefix to produce the CGA. 





Section 8.5 Neighbor Discovery in IPv6 


405 


A 3-bit unsigned parameter called Sec influences how resistant the approach 
is to mathematical compromise, which secure hash function is used [RFC4982], 
and how computationally expensive the computations are (they are exponential 
in the Sec value). The I ANA maintains a registry for Sec values [SI]. The Hashl 
and Hash2 funcfions operafe on fhe same CGA paramefer block in conjuncfion 
wifh fhe Sec value. The address owner begins by picking a random value for fhe 
Modifier field, freafing fhe subnef prefix field as 0, and compufing fhe Hash2 value. 
The resulf is required fo have (16’^Sec) inifial 0 bifs, so fhe inpuf is modified by 
incremenfing fhe modifier value by 1 and recompufing Hash2 unfil fhe condifion 
is safisfied. This compufafion has fime complexify 0(2'^*®®') and fherefore becomes 
much more expensive as Sec increases. However, fhis compufafion is required 
only when fhe address is inifially esfablished. 

Once fhe proper modifier has been found, 59 bifs of fhe Hashl value are used 
in forming fhe low-order 59 bifs of fhe inferface idenfifier. The fop 3 bifs consfifufe 
fhe 3-bif Sec value, and bifs 6-7 (from fhe leff) confain fwo 0 bifs (corresponding 
fo fhe u and g address bifs described in Chapfer 2). If fhe address is found fo be 
in conflicf (e.g., using duplicafe address defecfion, described in Chapfer 6), fhe 
Collision Count field is incremenfed and Hashl is recompufed. The collision counf 
value is nof permiffed fo grow beyond 2. Given fhaf address collisions are unlikely 
fo begin wifh, mulfiple such collisions should be considered evidence of a configu- 
rafion error or affack. Once all fhe necessary calculafions are complefe, fhe com- 
plefe CGA can be formed by concafenafing fhe subnef prefix. Sec value, and Hashl 
value. Nofe fhaf if fhe subnef prefix changes, only Hashl needs fo be recompufed 
as fhe modifier value can remain fhe same. (The reader inferesfed in alfernafives 
fo GCAs should consul! [RFC5535], which describes hash-based addresses, or HBAs. 
HBAs are used for mulfihomed hosfs using mulfiple prefixes in a somewhaf dif- 
ferenf confexf and wifh a differenf form of cryptography fhaf is less compufafion- 
ally expensive, alfhough HBA-CGA-compafible opfions have also been defined.) 

Af fhis poinf we have seen how a CGA is generafed buf nof how if is used for 
securify. Nofe fhaf anyone can generate a CGA given a subnef prefix. Sec value, 
and fheir own (or someone else's) public key. To ensure fhaf a CGA is well formed 
and is using an appropriate subnef prefix, if musf be verified, a process called CGA 
verification. A verifier requires knowledge of fhe CGA and CGA paramefers. The 
verificafion process involves ensuring all of fhe following: fhe collision counf is 
nof greater fhan 2, fhe CGA's subnef prefix mafches fhaf in fhe CGA paramefers, 
Hashl compufed on fhe CGA paramefers mafches fhe inferface idenfifier porfion 
of fhe CGA (where fhe firs! 3 bifs and bifs 6 and 7 are "don'f cares"), and fhe value 
of Hash2 compufed on fhe CGA paramefers wifh fhe Subnet Prefix and Collision 
Count fields sef fo 0 has (16*Sec) inifial 0 bifs. If all of fhese checks are successful, 
fhe CGA is a legifimafe one for fhe corresponding subnef prefix. This compufafion 
involves af mosf fwo hash funcfions; if is far simpler fhan fhe address generafion 
process. 

To verify fhaf a CGA is being used by ifs aufhorized address owner, called sig¬ 
nature verification, fhe owner forms a fyped message and affaches a CGA signafure 
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that can be computed only with knowledge of the private key corresponding to 
the public key used with the CGA. A verifier forms a dafa block by concafenafing 
a special 128-bif type tag with the message. The CGA ownership is verified using 
an RSA signafure (RSASSA-PKCSl-vl_5 [RFC3447]) wifh fhe public key (exfracfed 
from fhe CGA paramefers), dafa block, and signafure as paramefers. Generally, a 
CGA and ifs user are considered valid only if bofh fhe CGA verificafion and sig¬ 
nafure verificafion processes have complefed successfully 

The handling of GCAs and verificafion is accomplished using fwo iCMPv6 
messages and six opfions defined in [RFC3971]. The RFC also defines fwo lANA- 
managed regisfries for holding Name Type fields in fhe Trusf Anchor opfion and 
fhe Cert Type field in fhe Cerfificafe opfion (see Secfion 8.5.6.13). [RFC3972] defines 
fhe CGA Message Type regisfry, wifh fhe 128-bif value 0x086FCA5E10B200C99C8 
CE00164277C08 given in [REC3971] (ofher values are defined for uses ofher fhan 
SEND). A regisfry for Sec values is defined by [REC4982] buf af presenf provides 
only for values 0, 1, and 2, which correspond fo use of fhe SHA-1 secure hash 
funcfion using 0, 16, or 32 inifial 0 bifs for fhe Hash2 funcfion, respecfively. An 
exfension formaf defined in [REC4581] supporfs TLV encodings fhaf can be used 
for fufure sfandard exfensions, buf only one has been defined fo dafe [REC5535]. 
We will now describe fhe fwo lCMPv6 messages used wifh SEND and defer dis¬ 
cussion of fhe opfions unfil we cover all of fhe lCMPv6 opfions in fhe nexf secfion. 

8.5.5.2 Certification Path Soiicitation/Advertisement (iCMPv6 Types 148/149) 
SEND defines Solicifafion and Adverfisemenf messages fo help hosfs defermine 
cerfificafes consfifufing a cerfificafion pafh. This is used for a hosf fo verify fhe 
aufhenficify of roufer adverfisemenfs. Figure 8-39 shows fhe Solicifafion message. 


0 15 16 31 


Type (148) 

Code (0) 

Checksum 

Identifier 

Component 

Options 


Figure 8-39 The Certification Path Solicitation message. The sender requests a particular certifi¬ 
cate by position index, provided as the value of the Component field. The value 65535 
indicates that all certificates in the path rooted at the identity given within an attached 
Trust Anchor option are desired. 


The Certification Path Solicitation message contains a random Identifier field 
used for matching solicitations with advertisements. The value of the Component 
field provides an index to the point in the certification path in which the requestor 
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is interested. This value is set to all Is (value 65535) if certificates for fhe enfire 
pafh are desired. The messages may confain a Trusf Anchor opfion (see Secfion 
8.5.6.12). Cerfificafes and cerfificafion pafhs are described in more defail in Chap- 
fer 18. 

The Cerfificafion Pafh Adverfisemenf message, shown in Figure 8-40, pro¬ 
vides a mefhod fo express one componenf (cerfificafe) in a mulficomponenf adver¬ 
fisemenf. These messages are senf in response fo a solicifafion, or periodically by a 
SEND-capable roufer. When senf in response fo a solicifafion, fhe desfinafion IPv6 
address is fhe Solicifed-Node mulficasf address of fhe receiver. 


0 15 16 31 


Type (149) Code (0) 

Checksum 

Identifier 

All Components 

Component 

Reserved (0) 

Options 


Figure 8-40 The Certification Path Advertisement message. The sender requests a particular cer¬ 
tificate by position index, provided as the value of the Component field. The value 65535 
indicates all certificates in the path rooted at an identity given within an attached Trust 
Anchor option. 


The Identifier field holds fhe value received in a corresponding Solicifafion 
message. If is sef fo 0 for unsolicifed Adverfisemenf messages fhaf are senf fo fhe 
All Nodes mulficasf address. The All Components field indicafes fhe fofal number 
of componenfs in fhe enfire cerfificafion pafh, including fhe frusf anchor. Nofe 
fhaf a single adverfisemenf message is recommended fo avoid fragmenfafion, so 
such messages confain only a single componenf. The Component field gives fhe 
index in fhe cerfificafion pafh of fhe associafed cerfificafe (provided in an affached 
Cerfificafe opfion). The recommended order for sending adverfisemenfs for an 
N-componenf cerfificafion pafh is (N -1, N - 2,. . ., 0). Componenf N need nof be 
senf as if is already presenf from fhe frusf anchor. 

8.5.6 ICMPvG Neighbor Discovery (ND) Options 

As wifh many of fhe protocols of fhe IPv6 family, a sef of sfandard protocol head¬ 
ers are defined, and one or more opfions may also be included. ND messages 
may confain zero or more opfions, and some opfions can occur more fhan once. 
However, wifh cerfain messages some of fhe opfions are mandatory The general 
formal for ND opfions is given in Figure 8-41. 
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Figure 8-41 ND options are variable-length and begin with a common TLV arrangement. The 
Length field gives the total length of the option in 8-byte units (including the Type and 
Length fields). 


All ND options start with an 8-bit Type and an 8-bit Length field, supporting 
options of variable lengfh, up fo 255 byfes. Opfions are padded fo 8-byfe bound¬ 
aries, and fhe Length field gives fhe fofal lengfh of fhe opfion in 8-byfe unifs. The 
Type and Length fields are included in fhe value of fhe Length field, which has a 
minimum value of 1. Table 8-5 gives a lisf of 25 sfandard opfions fhaf have been 
defined as of mid-2011 (plus fhe experimenfal values). The official lisf may be 
found in [ICMP6TYPES]. 


Table 8-5 IPv6 ND option types, defining reference, use, and description 


Type 

Name 

Reference 

Use/Comment 

1 

Source Link-Layer 

Address 

[RFC4861] 

Sender's link-layer address; used with NS, RS, 
and RA messages 

2 

Target Link-Layer 

Address 

[RFC4861] 

Target's link-layer address; used with NA and 
Redirect messages 

3 

Prefix Information 

[RFC4861] 

[RFC6275] 

An IPv6 prefix or address; used with RA 
messages 

4 

Redirected Header 

[RFC4861] 

Portion of original IPv6 datagram; used with 
Redirect messages 

5 

MTU 

[RFC4861] 

Recommended MTU; used with RA messages, 
IND Advertisement messages 

6 

NMBA Shortcut Limit 

[RFC2491] 

Hop limit for "shortcut attempt"; used with NS 
messages 

7 

Advertisement Interval 

[RFC6275] 

Sending interval of unsolicited RA messages; 
used with RA messages 

8 

Home Agent Information 

[RFC6275] 

Preference and lifetime to be an MIPv6 HA; 
used with RA messages (H bit on) 

9 

Source Address List 

[RFC3122] 

Host's addresses; used with IND messages 

10 

Target Address List 

[RFC3122] 

Target addresses; used with IND messages 

11 

CGA 

[RFC3971] 

Crypto-based address; used with secure 
Neighbor Discovery (SEND) messages 

12 

RSA Signature 

[RFC3971] 

Credential for host signature (SEND) 
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Table 8-5 IPv6 ND option types, defining reference, use, and description (continued) 


Type 

Name 

Reference 

Use/Comment 

13 

Timestamp 

[RFC3971] 

Anti-replay timestamp (SEND) 

14 

Nonce 

[RFC3971] 

Anti-replay random number (SEND) 

15 

Trust Anchor 

[RFC3971] 

Indicates credential type (SEND) 

16 

Certificate 

[RFC3971] 

Encodes a certificate (SEND) 

17 

IP Address/Prefix 

[RFC5568] 

Care-of or NAR addresses; used with FMIPv6 
PrRtAdv messages 

19 

Link-Layer Address 

[RFC5568] 

Desired next access point or mobile node's 
address; used with FMIPv6 RtSolPr or 

PrRtAdv messages 

20 

Neighbor Advertisement 
ACK 

[RFC5568] 

Tells mobile about next valid CoA; used with 
RA messages 

24 

Route Information 

[RFC4191] 

Route prefix/preferred router list 

25 

Recursive DNS Server 

[RFC6106] 

IP address of DNS server; added to RA 
messages 

26 

RA Flags Extension 

[RFC5175] 

Expands space for RA flags 

27 

Handover Key Request 

[RFC5269] 

FMIPv6—request key using SEND 

28 

Handover Key Reply 

[RFC5269] 

FMIPv6—key reply using SEND 

31 

DNS Search List 

[RFC6106] 

DNS domain search names; added to RA 
messages 

253, 

254 

Experimental 

[RFC4727] 

[RFC3692]-style experiments 1/2 


8.5.6.1 Source/Target Link-Layer Address Option (Types 1, 2) 

The Source Link-Layer Address option (type 1; see Figure 8-42) is supposed to be 
included in ICMPvb RS messages, NS messages, and RA messages whenever used 
on a network supporting link-layer addressing. It specifies a link-layer address 
associated with the message. More than one of fhese opfions may be included for 
nodes wifh more fhan one address. 


1516 


31 


Type 
(1 or 2) 


Length 


Link-Layer Address 
(variable) 


Figure 8-42 The Source (type 1) and Target (type 2) Link-Layer Address options. The Length field 
gives the length of the entire option, including the address, in units of 8 bytes (e.g., an 
IEEE Ethernet-type address would have the value of 1 in the Length field). 
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The Target Link-Layer Address option (type 2), which uses a similar format, 
must be provided in an NA message when responding to multicast solicitations. 
This option is also typically included in Redirect messages (discussed previously) 
and must be included in such messages when operating on an NBMA network. 

8.5.6.2 Prefix Information Option (Type 3) 

The Prefix Informafion opfion (PIO), provided on RA messages and Mobile Prefix 
Adverfisemenf messages, indicafes fhe IPv6 address prefixes and (in some cases) 
complefe IPv6 addresses of individual nodes presenf on fhe link (see Figure 8-43). 
In cases where mulfiple prefixes or addresses are reporfed, mulfiple copies of fhis 
opfion may be included in a single message. A roufer is supposed fo include a PIO 
for each prefix if uses. An R bif field sef fo 1 indicafes fhaf fhe Prefix field confains 
fhe entire global IPv6 address of fhe sending roufer, rafher fhan jusf ifs prefix wifh 
fhe remaining bifs of fhe prefix field being 0 or ifs link-local address (presenf in fhe 
Source IP Address field of fhe confaining IPv6 dafagram). This is useful for Mobile 
IPv6 home agenf discovery, and home agenfs sending roufer adverfisemenfs musf 
include fhis opfion wifh fhe R bif field sef for af leasf one prefix. 



Figure 8-43 The Prefix Information option contains an IPv6 address prefix in use on the local net¬ 
work. It is used to provide hosts with prefixes for address autoconfiguration if the A 
bit field is set. The L bit field indicates that the prefix is acceptable for use in on-link 
determination. The R bit field is used to indicate that the included prefix is the entire 
global IPv6 address of the sending router. 


The Prefix Length field gives fhe number of bifs (up fo 128) in fhe Prefix field 
fhaf should be considered valid for use in configurafion. The L bif field is fhe 
"on-link" flag and indicafes fhaf fhe provided prefix is eligible fo be used for 
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on-link determination (see the next paragraph). If it is not set, it makes no state¬ 
ment one way or another about its use in on-link determination. The A bit field is 
fhe "autonomous aufoconfigurafion" flag and indicates fhaf fhe provided prefix 
may be used for aufoconfigurafion (see Chapter 6). The Valid Lifetime and Preferred 
Lifetime fields indicafe fhe number of seconds in which fhe prefix can be used for 
on-link deferminafion and aufomafic address aufoconfigurafion, respecfively. A 
value of OxFFFFFFFF for eifher field indicates infinify. 

In IPv6, nodes fhaf are "on-link" correspond fo fhose fhaf can be reached using 
direcf delivery (Chapter 5). In IPv4, nodes are assumed fo be on-link if fhey share 
a common prefix, defermined using a combinafion of fheir own IPv4 address and 
assigned subnef mask. Alfhough fhis arrangemenf can be achieved using IPv6, if 
is nof necessary, and on-link sfafus is nof assumed wifhouf confirmafion. Instead, 
fhe L bif field indicates fo a hosf or router which prefixes or lisf of individual hosfs 
is presenf on-link [RFC5942]. Of her mechanisms can also serve fhis purpose (e.g., 
DFICPv6, manual configurafion, or ICMPv6 Redirecf messages). A node is consid¬ 
ered off-link unless fhere is confirming informafion fo indicafe fhaf if is on-link. 

8.5.6.3 Redirected Header Option (Type 4) 

The Redirecfed Pleader opfion is used fo include a copy of (or parf of) fhe original 
("offending") IPv6 dafagram fhaf caused a Redirecf message fo be generafed. The 
opfion formaf is given in Figure 8-44. The opfion is ignored if if appears in any 
ofher fype of message. 


0 15 16 31 


Type (4) 

Length 

Reserved (0) 

Reserved (0) 


As much of offending datagram as possible so 
that the resulting IPv6/ICMPv6 datagram does 
not exceed the minimum MTU 


Figure 8-44 The Redirected Header option marks the beginning of a partial (or complete) copy of 
the offending IPv6 datagram. In any case, the message is limited to at most the mini¬ 
mum IPv6 MTU (currently 1280 bytes). 


8.5.6.4 MTU Option (Type 5) 

The MTU option is provided on RA messages and ignored otherwise (see Figure 
8-45). It provides the MTU to be used by hosts, assuming that a configurable MTU 
size is supported. 
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0 15 16 31 



Figure 8-45 The MTU option includes the MTU to be used on the local link. This option is used 
with RA messages and is most useful if a nonstandard or unknown MTU is to be used. 

The MTU option is important, for example, when bridging two or more het¬ 
erogeneous link-layer technologies that have different MTUs. Without this option 
(and assuming bridges do not generate ICMPv6 PTB messages), hosts may not be 
able to communicate reliably with other hosts on the bridged link-layer network. 
Note that this message reserves 32 bits to hold the MTU, supporting very large 
MTUs. 

8.5.6.5 Advertisement Interval Option (Type 7) 

This option may be included in RA messages and is ignored otherwise. It specifies 
fhe maximum inferval befween unsolicifed mulficasf roufer adverfisemenfs (see 
Figure 8-46). 


0 15 16 31 



Figure 8-46 The Advertisement Interval gives the number of milliseconds between unsolicited 
multicast Router Advertisement messages. 

The Adverfisemenf Inferval opfion gives fhe fime befween periodic roufer 
adverfisemenf messages. The Advertisement Interval field defines fhe maximum 
number of milliseconds befween fransmissions of RA messages senf by fhe 
sender of fhis message on fhe arriving nefwork. The sending roufer may send 
adverfisemenfs more frequenfly fhan fhe opfion indicafes, buf nof less frequenfly. 
This opfion is used by Mobile IPv6 nodes in ifs movemenf defecfion algorifhms 
[RFC6275]. 

8.5.6.6 Home Agent Information Option (Type 8) 

This opfion may be included in RA messages being senf from roufers willing fo 
acf as Mobile IPv6 home agenfs [RFC6275] (i.e., fhose fhaf sef fhe Hbif field in fheir 
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RA messages) and is ignored otherwise. The option is not allowed to be included 
if the H bit field is nof sef. In cases where solicifed RA messages are used such fhaf 
mulfiple addresses are carried in separafe messages and fhe R bif field is sef, fhis 
opfion musf be included wifh each of fhem and each musf confain fhe same value. 
Figure 8-47 shows fhe Home Agenf Informafion opfion formaf. 


0 15 16 31 


Type (8) 

Length (1) 

Reserved (0) 

Home Agent Preference 

Home Agent Lifetime (seconds) 


Figure 8-47 The Home Agent Information option indicates the preference and amount of time in 
which the sender of the option is willing to be considered a home agent for Mobile IPv6. 
Larger values of the Home Agent Preference field indicate a more desirable home agent. 
The Home Agent Lifetime field gives the number of seconds during which the sender is 
willing to be an HA. 


The Home Agent Preference field is a 16-bif unsigned infeger used fo help a 
mobile node order fhe addresses provided fo if via Home Agenf Address Dis¬ 
covery Reply messages. Larger values indicafe a greafer degree of preference for 
using fhe sending roufer as a home agenf. If fhis opfion is not included in a Roufer 
Adverfisemenf message where fhe H bif field (home agenf) is sef, fhe preference 
value of fhe originafing roufer musf be considered fo be 0 (lowesf preference). 

The Home Agent Lifetime field, also a 16-bif unsigned infeger, specifies fhe 
number of seconds in which fhe sender of fhe message should be considered eli¬ 
gible fo acf as a home agenf (wifh fhe corresponding preference described previ¬ 
ously). The defaulf value of fhis field is equal fo fhe Lifetime field of fhe confaining 
RA message. The maximum value of fhis field (65,535) corresponds fo 18.2 hours, 
and fhe minimum value is 1 (0 is nof allowed). If bofh fhe Home Agent Lifetime and 
fhe Home Agent Preference fields confain only defaulf values, fhe enfire opfion is nof 
supposed fo be included in fhe RA message. 

8.5.6.7 Source/Target Address List Options (Types 9, 10) 

These opfions may be included wifh an IND message [RFC3122]. The formaf is 
given in Figure 8-48. The Source Address Lisf opfion (fype 9) confains a lisf of fhe 
IPv6 addresses idenfified by fhe Source Link-Layer Address opfion. The Targef 
Address Lisf opfion (fype 10) confains a lisf of fhe IPv6 addresses idenfified by fhe 
Desfinafion Link-Layer Address opfion. The number of addresses included in fhe 
opfion is equal fo {Length - l)/2, where fhe Length field value confains fhe size of 
fhe opfion in 8-byfe unifs. 
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0 15 16 31 



Figure 8-48 The Source (type 9) and Target (type 10) Address List options. These are used in sup¬ 
porting IND and provide a list of a node's IPv6 addresses. Only the addresses used on 
the interface used to send the message should be included. 


8.5.6.8 CGA Option (Type 11) 

The CGA option is used with SEND [RFC3971] to carry the CGA parameters nec¬ 
essary for a verifier fo perform CGA validafion and signafure validafion. Ifs for- 
maf is given in Figure 8-49. 

The CGA Parameters area is composed of fhe same fields depicfed in Figure 
8-38. See [RFC3971] for more defails. 
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0 

1516 

31 

Type (11) 

Length 

Pad Length 

Reserved (0) 

1 

CGA Parameters 

1 


(variable) 


Padding (0) i 

(variable) i 


Figure 8-49 The CGA option used with SEND. The option encodes the CGA parameters shown in 
Figure 8-38. 


8.5.6.9 RSA Signature Option (Type 12) 

The RSA Signature option is used with SEND [RFC3971] to carry an RSA signa¬ 
ture (see Chapter 18) that a verifier can use, in conjunction with CGA parameters, 
to determine if a sending sysfem has possession of fhe privafe key associafed wifh 
a CGA's public key. Ifs formaf is given in Figure 8-50. 


0 


1516 


31 


Type (12) 

Length 

Reserved (0) 

i 

Key Hash 
(128 bits) 


Digital Signature 
(variable) 


Padding (0) 
(variable) 


20 bytes 


Figure 8-50 The RSA Signature option used with SEND. The signature is encoded in the PKCS#1 v 
1.5 (see Chapter 18) format and is used to verify that the sender possesses the matching 
private key and is consequently the correct owner of the CGA. 
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The Key Hash field contains the high-order 128 bits of a SHA-1 hash of the 
public key used in constructing the signature. The Digital Signature field contains 
a standardized signature over the following values: the CGA Message Type tag 
for SEND, the source IP and destination IP addresses, the first 32-bit word of the 
ICMPv6 header {Type, Code, and Checksum fields), and the ND protocol message 
header and options (not including the RSA signature option). 

8.5.6.10 Timestamp Option (Type 13) 

The Timestamp option gives the current time of day known to the sending system. 
This helps counter potential replay attacks against SEND [REC3971]. Its format is 
given in Eigure 8-51. 


0 1516 31 


Type (13) 

Length (2) 

Reserved 
(48 bits) 






Timestamp 


(64 bits; seconds since January 1, 1970) 

V 


Figure 8-51 The Timestamp option used with SEND. The value encodes the number of seconds that 
have elapsed since January 1,1970. It is used to guard against replay attacks. 


The Timestamp field contains the number of seconds since January 1, 1970, 
00:00 UTC. The format is fixed-point. The high-order 48 bits encode the number 
of complete seconds. The remaining bits indicate the number of (1/64K) fractions 
of a second. 

8.5.6.11 Nonce Option (Type 14) 

The Nonce option holds a recently generated random number. This helps counter 
potential replay attacks against SEND [REC3971]. Its format is given in Eigure 8-52. 


0 1516 31 


Type (14) 

Length 




Nonce Value 


(variable, at least 6 bytes) 


Figure 8-52 The Nonce option used with SEND. The value encodes a random number used in pairs 
of SEND messages. It is used to guard against replay attacks. 
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The nonce value is a random number selected by the sender. The length of the 
number must be at least 6 bytes. Details on using nonces to combat replay attacks 
are given in Chapter 18. 

8.5.6.12 Trust Anchor Option (Type 15) 

The Trust Anchor option includes the name (root) of a cerfificafion pafh (see Chap- 
fer 18). This is used wifh SEND for a hosf fo verify fhe aufhenficify of RA mes¬ 
sages. Ifs formaf is given in Figure 8-53. 


0 

1516 

31 

Type (15) 

Length 

Name Type 

Pad Length 


Name 

(variable) 


Padding (0) 
(variable) 


Figure 8-53 The Trust Anchor option used with SEND. The trust anchor is the name of the root of 
a certificate chain. Subordinate certificates may be validated against the trust anchor. 
Certificate chains are used in SEND for a host to validate router advertisements. 


The Name Type field indicafes fhe type of name used. Currently, two values 
have been defined: 1, DER X.502 names; 2, fully qualified domain name (FQDN). 
More than one Trust Anchor option may be included. The Name field gives the 
name of the trust anchor in the format specified by the Name Type field. The trust 
anchor is the root of trust for a certificate chain that the sender of the message is 
willing to accept (see Chapter 18). 

8.5.6.13 Certificate Option (Type 16) 

The Certificate option holds a single certificate used with SEND [RFC3971] in pro¬ 
viding a certification path. Its format is given in Figure 8-54. 

The Cert Type field indicates the type of certificate used. Currently, one value 
has been defined: 1, X.509v3 certificate. Certificates and how they are managed 
are discussed in more detail in Chapter 18. 

8.5.6.14 iP Address/Prefix Option (Type 17) 

The IP Address/Prefix option is used with FMIPvb messages (ICMPvb type 154) 
[RFC5568]. Its format is given in Figure 8-55. 

The Option-Code field value indicates which type of address is encoded: 1, 
old care-of address; 2, new care-of address; 3, new access router's (NAR's) IPv6 
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0 

1516 

31 

Type (16) 

Length 

Cert Type 

Reserved 


Certificate 

(variabie) 


Padding (0) 
(variabie) 


Figure 8-54 The Certificate option used with SEND. The option holds a cryptographic certifi¬ 
cate comprising one component of a certification path. This is used to validate router 
advertisements. 


15 16 


31 


Type (17) 


Length (3) Option-Code 


Prefix Length 


Reserved (0) 


iPv6 Address 
(or Prefix) 


24 bytes 


Figure 8-55 The IP Address/Prefix option used with FMIPv6. The option holds a prefix or IPv6 
address of the next access router or care-of address used by a mobile node. 


address; 4, NAR's prefix (in PrRtAdv). The Prefix Length field gives fhe number of 
valid leading bifs in fhe IPv6 Address field. The IPv6 Address field encodes fhe IPv6 
address idenfified in fhe Option-Code field. 

8.5.6.15 Link-Layer Address (LLA) Option (Type 19) 

The Link-Layer Address (LLA) opfion is used wifh FMIPv6 messages (ICMPv6 
fype 154) [RFC5568]. Ifs formal is given in Figure 8-56. 

The Option-Code field value indicafes how fhe associafed Link-Layer Address 
field value is fo be inferprefed: 0, wildcard—resolufion requesfed for all nearby 
APs; 1, address of fhe new AP; 2, address of fhe mobile node; 3, address of fhe 
new access roufer; 4, address of fhe source of fhe RfSolPr/PrRfAdv message; 5, 
address is currenf for fhe roufer; 6, no prefix informafion available for fhe AP 
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0 

1516 

31 

Type (14) 

Length 

Option-Code 



Link-Layer Address 
(variable) 


Figure 8-56 The Link-Layer Address option used with FMIPv6. The option-code value indicates 
what entity is associated with the address (i.e., any AP, particular AP, NAR, sender 
of RtSolPr or PrRtAdv message, router), if prefix information is available, and if fast 
handovers are supported by the AP indicated in the LLA. 


corresponding to the address; 7, no fast handovers available for fhe AP addressed. 
The Link-Layer Address field confains fhe address idenfified by fhe Option-Code 
field. 

8.5.6.16 Neighbor Advertisement ACK (NAACK) Option (Type 20) 

This opfion is used wifh FMIPvb messages (ICMPvb type 154) [RFC5568]. Its for¬ 
mat is given in Figure 8-57. 


0 15 16 31 


Type (20) 

Length (1 or 3) 

Option-Code 

Status 

Reserved (0) 


NCoA 

(New Care-of Address, If Provided) 


Figure 8-57 The Neighbor Advertisement Acknowledgment option used with FMIPv6. When a 
mobile node moves from a previous access router to a new access router and proposes 
to use a particular new care-of address, the new router indicates the acceptability of the 
proposed address. 


The Option-Code value is 0. The Status field indicafes fhe disposifion of fhe 
unsolicifed neighbor adverfisemenf. The following values are defined: 1, new 
care-of address (NCoA) is invalid (perform address configurafion); 2, NCoA is 
invalid (use NCoA supplied in IP Address opfion); 3, NCoA is invalid (use NAR's 
address as NCoA); 4, previous care-of address (PCoA) supplied (do nof send bind¬ 
ing updafe); 128, link-layer address unrecognized. 
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8.5.6.17 Route Information Option (Type 24) 

This option is used with RA messages to indicate which off-link prefixes are 
reachable fhrough a parficular roufer [RFC4191]. Ifs formaf is given in Figure 8-58. 


0 

15 

16 



31 

Type (24) 

Length 

Prefix Length 

Resv (0) 

(3 bits) 

Pref 

(2 bits) 

Resv (0) 

(3 bits) 

Route Lifetime (seconds) 


Prefix 





(variable) 





Figure 8-58 The Route Information option indicates the preference for using a particular router to 
reach a particular off-link prefix. It is most useful in cases where multiple default rout¬ 
ers are available and perform differently in reaching the same destinations. 


The Prefix Length field gives fhe number of valid leading bifs in fhe Prefix field. 
The Pref field indicafes whefher fhe roufer associafed wifh fhe included prefix 
should be preferred over ofhers. If fhis field confains fhe value 2, fhe opfion musf 
be ignored. The Route Lifetime field gives fhe number of seconds for which fhe 
prefix is fo be considered valid. The value of all Is indicafes infinify. The variable- 
lengfh Prefix field gives fhe IPv6 prefix being described. 

8.5.6.18 Recursive DNS Server Option (RDNSS) (Type 25) 

The Recursive DNS Server (RDNSS) opfion, defined in [RFC6106], can be used 
wifh RA messages fo enhance sfafeless aufoconfigurafion by providing fhe IPv6 
address of one or more DNS servers (see Chapfers 6 and 11). Mulfiple RDNSS 
opfions may be included wifh an RA message. The formaf is given in Figure 8-59. 

The Lifetime field gives fhe amounf of fime in seconds during which fhe lisf 
of DNS server addresses should be considered valid. The all-ls value indicafes 
an infinife lifefime. If differenf lifefimes are required, mulfiple disfincf RDNSS 
opfions may be included in fhe same RA message. 

8.5.6.19 Router Advertisement Flags Extension Option (EFO) (Type 26) 

This opfion exfends fhe Flags field used in RA messages [RFC5175]. If is also some- 
fimes called fhe Expanded Flags option (EFO). Ifs formaf is given in Figure 8-60. 

The Length field is currenfly defined fo be 1 unfil fhe subsequenf bifs are 
allocafed. 
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0 15 16 31 



Figure 8-59 The Recursive DNS Server option indicates the IPv6 address(es) of one or more DNS 
servers capable of performing recursive lookups (see Chapter 11). 


1516 


31 


Type (26) 


Length 


Bit Fields 


Available for Assignment 
(variable) 


Figure 8-60 The Router Advertisement Expanded Flags option provides an arbitrary amount of 
additional space for defining future RA flags. 
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8.5.6.20 Handover Key Request Option (Type 27) 

The Handover Key Request option is used with FMIPv6 messages that use SEND 
to secure signaling information [RFC5269]. Its format is given in Figure 8-61. 


0 

1516 


31 

Type (27) 

Length 

Pad Length 

AT 

(4 bits) 

Resv (0) 
(4 bits) 


Handover Key Encryption Public Key 
(variable) 


Padding (0) 
(variable) 


Figure 8-61 The Handover Key Request option is used with FMIPv6 signaling secured by SEND 
and provides CGA parameters including a public key. A router uses this information in 
forming a handoff key that is provided encrypted for a mobile node. 


The Pad Length field gives the number of 0 padding bytes included at the end 
of the option (included in the Length field). The Algorithm Type (AT) field identifies 
the algorithm used to compute the authenticator (see [RFC5568]). The Handover 
Key Encryption Public Key field encodes the FMIPv6 CGA public key in the same 
format used with the CGA option. The Padding area contains bytes with value 0 to 
ensure that the option is a multiple of 8 bytes. 

8.5.6.21 Handover Key Reply Option (Type 28) 

This option is used with FMIPv6 messages that use SEND to secure signaling 
information [RFC5269]. Its format is given in Figure 8-62. 

The Pad Length and AT fields are as given with the Handover Key Request 
option. The Key Lifetime field gives the number of seconds for which the hand¬ 
over key is valid (the default is HK-LIFETIME or 43,200s). The Encrypted Hand¬ 
over Key field holds a symmetric key (see Chapter 18) encrypted using the mobile 
node's handover key encryption key. The encoding format is RSAES-PKCSl-vl_5 
[RFC3447]. The Padding field contains bytes with value 0 to ensure that the option 
is a multiple of 8 bytes. 

8.5.6.22 DNS Search List Option (DNSSL) (Type 31) 

The DNS Search List (DNSSL) option [RFC6106] is used to indicate a list of domain 
name extensions to be added to DNS queries a host might issue. Search lists are 
part of the DNS configuration information that may be provided to a host when it is 
initialized (see Chapter 6). The format of the DNSSL option is shown in Figure 8-63. 
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Padding (0) 
(variable) 


Figure 8-62 The Handover Key Reply option is used with FMIPv6 signaling secured by SEND and 
provides a symmetric handoff key encrypted using the mobile node's public key. Only 
the correct mobile node possessing the corresponding private key can decrypt the 
option to recover the key. 


0 15 16 31 



Domain Names of DNS Search List 


(variable, using [RFC1035] format) 


Figure 8-63 The DNS Search List option provides a list of default domain name extensions used 
when configuring a host's DNS parameters. The encoding format is the same one used 
for encoding DNS names (see Chapter 11). 


The Lifetime field indicates how many seconds from the time the message is 
sent that the domain search list should be considered valid. The domain name 
search list includes a list of (uncompressed) domain name extensions used as a 
form of default for forming FQDNs from partial strings (see Chapter 11). 

8.5.6.23 Experimental Values (Types 253, 254) 

These values are used only for experimentation, as described in [RFC3692]. 
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8.6 Translating ICMPv4 and ICMPv6 

In Chapter 7 we discussed a framework for IPv4/IPv6 franslafion based on 
[RFC6144] and [RFC6145] and discussed how IP headers are franslafed. The mefh- 
ods used fo franslafe ICMPv4 fo ICMPv6 and vice versa are also described in 
[RFC6145]. When franslafing ICMP, bofh fhe IP and ICMP headers are franslafed 
(i.e., modified and replaced). In addifion, ICMP error messages, which confain 
an infernal offending packef header and dafa, have fhe infernal (offending) dafa- 
gram's headers franslafed. Aside from mapping fhe appropriafe fype and code 
numbers, fhere are addifional concerns regarding fragmenfafion, MTU sizes, and 
checksum compufafions. Recall fhaf ICMPv6 uses a pseudo-header checksum 
covering informafion af fhe nefwork layer, whereas fhe ICMPv4 checksum is com- 
pufed only over ICMPv4 informafion. 

8.6.1 Translating ICMPv4 to ICMPvG 

When translating ICMPv4 informational messages to ICMPv6, only the Echo 
Request and Echo Reply types are translated. To perform the translation, the type 
values (8 and 0) are translated to values 128 and 129, respectively. After this trans¬ 
lation, the ICMPv6 pseudo-header checksum is computed and applied. When 
translating ICMPv4 error messages, only the following error messages are trans¬ 
lated: Destination Unreachable (type 3), Time Exceeded (type 11), and Parameter 
Problem (type 12). Table 8-6 gives the type and code value mappings used to per¬ 
form translation. Types and codes not shown are not translated, and the arriving 
packet that would have been translated is instead dropped. 


Table 8-6 Type and code mappings used to translate ICMPv4 error messages to ICMPv6 


ICMPv4 

Type/Code 

ICMPv4 Descriptive Name 

ICMPv6 

Type/Code 

ICMPv6 Descriptive Name (Note) 

3/0 

Destination 

Unreachable—Network 

1/0 

Destination Unreachable—No Route 

3/1 

Destination Unreachable—Host 

1/0 

Destination Unreachable—No Route 

3/2 

Destination 

Unreachable—Protocol 

4/1 

Parameter Problem—Unrecognized 
Next Header (set Pointer to indicate 
Next Header) 

3/3 

Destination Unreachable—Port 

1/4 

Destination Unreachable—Port 

3/4 

Destination Unreachable— 
Fragmentation Required (PTB) 

2/0 

PTB (adjust MTU field to reflect size 
of larger IPv6 header) 

3/5 

Destination Unreachable— 
Source Route Failed 

1/0 

Destination Unreachable—No Route 
(unlikely to occur) 

3/16,71 

Destination Unreachable— 
Destination Network/Host 
Unknown 

1/0 

Destination Unreachable—No Route 
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Table 8-6 Type and code mappings used to translate ICMPv4 error messages to ICMPv6 (continued) 


ICMPv4 

Type/Code 

ICMPv4 Descriptive Name 

ICMPv6 

Type/Code 

ICMPv6 Descriptive Name (Note) 

3/8 

Destination Unreachable— 
Source Host Isolated 

1/0 

Destination Unreachable—No Route 

3/19,10) 

Destination Unreachable— 
Destination Network/Host 
Administratively Prohibited 

1/1 

Destination Unreachable— 
Communication with Destination 
Administratively Prohibited 

3/111,12) 

Destination Unreachable—ToS 
Unavailable 

1/0 

Destination Unreachable—No Route 

3/13 

Destination Unreachable— 
Administratively Prohibited 

1/1 

Destination Unreachable— 
Communication with Destination 
Administratively Prohibited 

3/14 

Destination Unreachable— 

Host Precedence Violation 

N/A 

(Drop) 

3/15 

Destination Unreachable— 
Precedence Cutoff in Effect 

1/1 

Destination Unreachable— 
Communication with Destination 
Administratively Prohibited 

11/10,1) 

Time Exceeded—TTL, 

Fragment Reassembly 

3/10,1) 

Time Exceeded (code remains 
unchanged) 

12/0 

Parameter Problem—Pointer 
Contains Byte Offset of Error 

4/0 

Parameter Problem—Erroneous 
Header Field Encountered (update 
Pointer as in Table 8-7) 

12/1 

Parameter Problem—Missing 
Option 

N/A 

(Drop) 

12/2 

Parameter Problem—Bad 

Length 

4/0 

Parameter Problem—Erroneous 
Header Field Encountered (update 
Pointer as in Table 8-7) 


As shown in Table 8-6, for Parameter Problem messages where the Pointer 
field gives the byte offset of the problem, an additional mapping is used to form 
the appropriate value for the IPv6 Pointer field. Table 8-7 gives this mapping. 

In addition to performing the header translations, the offending datagram 
carried in an ICMPv4 error message is also translated according to the rules for 
IPv4/IPv6 translation. Note that this implies the resulting ICMPv6 datagram may 
be of a significantly different size from what it would be if the internal translation 
were not performed. The Total Length field in the base IPv6 header is updated to 
reflect any such effects. Note that only a single level of such inner translation is 
supported. If one or more additional internal headers are discovered, the packet 
being translated is discarded. Generally, packets other than ICMP messages fail¬ 
ing translation result in an ICMPv4 Destination Unreachable—Communication 
Administratively Prohibited (code 13) message being sent to the sender of the 
failed packet. 
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Table 8-7 Pointer field mappings used when translating ICMPv4 Parameter Problem messages to ICMPv6 


IPv4 

IPv4 

IPv6 

IPv6 

Pointer 

Header 

Pointer 

Header 

Value 

Field 

Value 

Field 

0 

Version/IHL 

0 

Version/DS Field/ECN (Traffic Class) 

1 

DS Field/ECN (ToS) 

1 

DS Field/ECN (Traffic Class)/Flow Label 

2,3 

Total Length 

4 

Payload Length 

4,5 

Identification 

N/A 


6 

Flags/Fragment Offset 

N/A 


7 

Fragment Offset 

N/A 


8 

Time to Live 

7 

Hop Limit 

9 

Protocol 

6 

Next Header 

10,11 

Header Checksum 

N/A 


12-15 

Source IP Address 

8 

Source IP Address 

16-19 

Destination IP Address 

24 

Destination IP Address 


Note that as with other IPv4 traffic being franslafed fo IPv6 (see Chapfer 7), 
packefs arriving wifh fhe DF bif field nof sef resulf in one or more IPv6 packefs 
wifh Fragmenf headers included and resulfing fragmenfs nof exceeding fhe IPv6 
minimum MTU. This is fo deal wifh fhe issue fhaf IPv4 roufers are permitted fo 
fragmenf IPv4 fraffic (including ICMPv4 fraffic) buf IPv6 roufers are nof. ICMPv4 
PTB messages may need fo be franslafed fo ICMPv6 PTB messages fhaf confain an 
MTU less fhan fhe IPv6 minimum link MTU of 1280 byfes. A properly operafing 
IPv6 slack processes all such messages and sends subsequenf dafagrams fo fhe 
same desfinafion equipped wifh Fragmenf headers. 

8.6.2 Translating ICMPvG to ICMPv4 

Among ICMPv6 informational messages. Echo Request (type 128) and Echo 
Reply (type 129) messages are translated to ICMPv4 Echo Request (type 8) and 
Echo Reply (type 0) messages, respectively. The checksum is updated to take into 
account the type value changes and the lack of the pseudo-header computation. 
Other informational messages are discarded. Table 8-8 shows how error messages 
are translated, giving the incoming (ICMPv6) and outgoing (ICMPv4) type and 
code numbers. 

Once again, the Pointer field used with the Parameter Problem message 
requires special handling. Table 8-9 provides this mapping for the ICMPv6-to- 
ICMPv4 case. 

Note that the ICMPv4 checksum does not use a pseudo-header, so when per¬ 
forming a header translation, the resulting checksum must be updated appropri¬ 
ately if a non-checksum-neutral address translation is performed. In addition, 
the internal IPv6 datagram may contain addresses that are not IPv4-translatable 
addresses, resulting in a need for stateful translation (see Chapter 7). 
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Table 8-8 Type and code mappings used to translate lCMPv6 error messages to lCMPv4 


ICMPv6 

Type/Code 

ICMPv6 Descriptive Name 

ICMPv4 

Type/Code 

ICMPv4 Descriptive Name (Note) 

1/0 

Destination Unreachable—No 
Route 

3/1 

Destination Unreachable—Host 

1/1 

Destination Unreachable— 
Communication with 
Destination Administratively 
Prohibited 

3/10 

Destination Unreachable— 
Destination Host Administratively 
Prohibited 

1/2 

Destination Unreachable— 
Beyond Scope of Source 

Address 

3/1 

Destination Unreachable—Host 

1/3 

Destination 

Unreachable—Address 

3/1 

Destination Unreachable—Host 

1/4 

Destination Unreachable—Port 

3/3 

Destination Unreachable—Port 

2/0 

PTB (adjust MTU field to reflect 
size of larger IPv6 header) 

3/4 

Destination Unreachable— 
Fragmentation Required (PTB) 

3/10,1) 

Time Exceeded—Hop Limit, 
Fragment Reassembly 

11/10,1) 

Time Exceeded—^TTL, Fragment 
Reassembly (code value is 
unchanged) 

4/0 

Parameter Problem—Erroneous 
Header Field Encountered 

12/0 

Parameter Problem—Pointer 
Contains Byte Offset of Error 
(update Pointer as in Table 8-7) 

4/1 

Parameter Problem— 
Unrecognized Next Header 

3/2 

Destination Unreachable—Protocol 
(set Pointer to indicate Protocol field) 

4/2 

Parameter Problem— 
Unrecognized IPv6 Option 
Encountered 

N/A 

(Drop) 


Table 8-9 Pointer field mappings used when translating ICMPv6 Parameter Problem messages to ICMPv4 


IPv6 

IPv6 

IPv4 

IPv4 

Pointer 

Header 

Pointer 

Header 

Value 

Field 

Value 

Field 

0 

Version/DS Field/ECN (Traffic Class) 

0 

Version/IHL/DS Field/ECN (ToS) 

1 

DS Field/ECN (Traffic Class)/Plow Eabel 

1 

DS Field/ECN (ToS) 

2,3 

Flow Label 

N/A 


4,5 

Payload Length 

N/A 

Total Length 

6 

Next Header 

9 

Protocol 

7 

Hop Limit 

8 

Time to Live 

8-23 

Source IP Address 

12 

Source IP Address 

24-39 

Destination IP Address 

16 

Destination IP Address 
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When handling differences in packef sizes, recall fhaf fhere is no Don't Frag¬ 
ment indicafion in IPv6 dafagrams ("don'f fragmenf" is implicifly always frue), 
and roufers cannof perform fragmenfafion. As a resulf, IPv6 packefs arriving af 
fhe franslafor fhaf do nof fif in fhe MTU of fhe IPv4 inferface used fo reach fhe 
nexf hop are discarded and an appropriafe ICMPv6 PTB message is senf back fo 
fhe IPv6 source of fhe offending dafagram. 


8.7 Attacks Involving ICMP 

The fypes of affacks involving ICMP fall primarily info fhree cafegories:/foods, 
bombs, and information disclosure. In essence, floods cause a large amounf of fraffic 
fo be generafed, leading fo an effecfive DoS affack on one or more compufers. The 
bomb class (somefimes called nuke class) refers fo sending specially consfrucfed 
messages fhaf cause IP or ICMP processing fo crash or hang. Informafion disclo¬ 
sure affacks do nof fypically cause harm by fhemselves buf can be used fo inform 
fhe approaches used by ofher affack mefhods fo avoid wasfing fime or avoid being 
defecfed. ICMP affacks againsf TCP have been documenfed separafely [RFC5927]. 

One of fhe early affacks involving ICMP is called fhe smwr/af fack. This amounfs 
fo using ICMPv4 wifh a broadcasf desfinafion address fo induce a large number of 
compufers fo respond. If fhis is done rapidly, if can resulf in a DoS affack because 
fhe vicfim compufer is too busy processing fhe ICMP fraffic fo do anyfhing else. 
Generally fhis affack is mounfed by setting fhe source IP address fo fhe infended 
vicfim's address. Thus, when fhe broadcasf ICMP message is received by several 
compufers, all of fhem respond simulfaneously fo fhe source address in fhe ICMP 
message (i.e., fhe vicfim's). This affack is easily handled by disallowing incoming 
directed broadcasf fraffic af fhe firewall perimeter. 

Wifh ICMPv4 Echo Requesf/Reply (ping) messages, if is possible fo consfrucf 
packef fragmenfs in such a way fhaf when fhey are reassembled, fhey form an 
IPv4 dafagram fhaf is foo large (larger fhan fhe maximum of 64KB). This has been 
used fo cause some sysfems fo crash and fherefore represenfs anofher form of DoS 
affack. If is somefimes called fhe ping of death affack. A somewhaf relafed affack 
involves changing fhe Fragment Offset fields in IPv4 headers so as fo induce errors 
in fhe IPv4 fragmenf reassembly routes. This is known as fhe teardrop affack. 

Anofher unanficipafed sifuafion fhaf has been faken advanfage of is fhe 
assumpfion fhaf an ICMP message would have disfincf source and desfinafion 
addresses. In fhe land affack, an ICMP message confaining a source and desfina¬ 
fion IP address equal fo fhe vicfim's is senf fo fhe vicfim. Some implemenfafions 
read in unforfunafe ways when receiving such a message. 

The ICMP redirecf capabilify can be used fo cause an end system fo use an 
incorrecf sysfem as a nexf-hop roufer. Alfhough a number of checks are made on 
incoming ICMP Redirecf messages in hopes of ensuring fhaf fhey really origi¬ 
nated wifh fhe currenf defaulf roufer, fhese fogefher fail fo ensure fhaf fhe mes¬ 
sage is aufhenfic. Wifh fhis affack, a man-in-fhe-middle (see Chapter 18) can be 
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inserted along the flow of fraffic, which is fhen recorded and analyzed. In addi- 
fion, if could be modified fo cause unwanfed acfions. If can achieve similar resulfs 
fo fhe ARP poisoning affack (see Chapfer 4). In addifion, if has been used fo cause 
a vicfim fo believe fhaf itself is fhe preferred gafeway fo a desfinafion. This causes 
an infinife loop and a consequenfial lockup of fhe vicfim compufer. 

The ICMP Roufer Adverfisemenf and Roufer Solicifafion messages can be 
used fo creafe an affack fhaf somewhaf resembles fhe redirecf affack. In parficu- 
lar, fhese messages can be used fo induce vicfim sysfems fo change fheir defaulf 
roufes fo poinf fo a compromised machine. In addifion, passively receiving fhese 
messages can enable an affacker fo learn abouf fhe topology of fhe local nefwork 
environmenf. Nofe fhaf fhe problem of such "rogue RAs," whefher malicious or 
accidenfal, has been considered in more defail separately [RFC6104]. 

ICMP can be used as a communicafion channel among invading programs 
fhaf wish fo coordinate. In fhe TFN {Tribe Flood Network) affack, ICMP is used fo 
coordinafe among a group of collaborafing viruses affer fhey have compromised 
compufers. 

The sef of ICMP Desfinafion Unreachable messages can be used fo cause 
denial of service fo currenfly exisfing connecfions (e.g., TCP connecfions). In some 
implemenfafions, receiving a Hosf Unreachable, Porf Unreachable, or Protocol 
Unreachable message from an IP address causes all fransporf-layer connecfions 
currenfly associated wifh fhaf address fo be closed. These attacks are somefimes 
called smack or Uoop attacks. 

The ICMP Timesfamp Requesf/Reply message (which is nof used anymore in 
normal operafions) can be used, if enabled, fo learn fhe fime of day according fo 
some hosf. Because many approaches fo securify are based on using crypfogra- 
phy wifh random keys, if fhe source and sfafe of randomness were fo be known, 
an exfernal acfor could predicf fhe sequence of pseudo-random numbers (fhaf is 
why fhey are only pseudo-random) used for creafing cryptographic keys, possibly 
allowing a fhird parfy fo guess ofherwise secref values and hijack connecfions 
(see Chapfer 13 on TCP and Chapfer 18's discussion of random numbers). Because 
many random numbers are based on fhe currenf fime of day, revealing a hosf's 
precise nofion of fhe fime could be a problem. 

Yef anofher affack involves modificafion of fhe PTB message. Recall fhaf fhis 
message confains a field indicafing fhe recommended MTU. This is used by frans- 
porf protocols such as TCP fo pick fheir packef size. If an affacker modifies fhis 
value, if can force an endpoinf TCP fo run wifh very small packefs (resulfing in 
poor performance). 

Mosf of fhese affacks have been made ineffecfive by modifying fhe ICMP 
implemenfafions presenf in popular operafing sysfems. However, wifhouf cryp¬ 
tography, spoofing or masquerading affacks are sfill possible, in general. Profocols 
fhaf use crypfographic mefhods (e.g., SEND) offer an enhanced level of securify 
buf may be considerably more complicafed fo deploy and analyze when problems 
arise. 
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8.8 Summary 

In this chapter we have looked at the Internet Control Message Protocol (ICMPv4 
and ICMPv6), a required part of every IP implementation. ICMP messages are car¬ 
ried in IP datagrams and are the first messages we have discussed that carry an 
end-to-end checksum (a pseudo-header checksum in the case of ICMPv6). ICMP 
messages may be broadly divided into error and informational message types. 
Generally speaking, ICMP error messages are not generated in response to prob¬ 
lematic ICMP error messages to avoid message flooding. For IP, ICMP provides a 
limited information and error-reporting capability. However, the important Echo 
Request/Reply and Time Exceeded messages are necessary to support the popu¬ 
lar ping and traceroute tools. Other (less visible) uses include the Destination 
Unreachable, PTB, and Redirect messages that are necessary for proper operation 
of path MTU discovery and efficient router selection. 

We looked at the ICMP Destination Unreachable, Redirect, and Echo Request/ 
Reply messages in some detail. We also saw the fairly common ICMP Port 
Unreachable error message. This let us examine the information returned in an 
ICMP error: the IP header and as much of the IP datagram that caused the error as 
possible without causing the error message to become fragmented. This informa¬ 
tion is required by the receiver of the ICMP error, to know more about the cause 
of the error and to help direct the error message to the appropriate process or 
protocol implementation. There is an extension facility that can be applied to cer¬ 
tain ICMP messages to carry additional information (e.g., MPLS tags or next-hop 
router information). 

ICMPv6 is a far more complex and important protocol to IPv6 as compared 
to ICMPv4 for IPv4. It is critical for the basic configuration and operation of IPv6 
systems. ICMPv6 includes most of the useful ICMPv4 messages (e.g.. Destination 
Unreachable, Time Exceeded, Eragmentation Required, Echo Request/Reply) but 
also handles ND (like ARP in IPv4), allows IPv6 nodes to discover their on-link 
hosts and default routers, and provides discovery services and dynamic configu¬ 
ration for MIPv6 nodes. ICMPv6 is also used for managing multicast group mem¬ 
berships, whereas this is accomplished using the IGMP protocol for IPv4. We shall 
examine both in Ghapter 9. IGMPv6 defines a rich set of options used with ND, 
some of which are required. Because IGMPv6 is used for so many host configu¬ 
ration messages that could be subject to attack, there is a secure variant (SEND) 
that allows addresses to be verified using cryptographically generated addresses 
(GGAs). GGAs are interesting in their own right and are used in protocols other 
than SEND. 
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Broadcasting and Local 
Multicasting (IGMP and MLD) 


9.1 Introduction 

We mentioned in Chapter 2 that there are four kinds of IP addresses: unicast, any- 
cast, multicast, and broadcast. IPv4 may use all of fhem, and IPv6 uses any excepf fhe 
lasf form. In fhis chapfer we discuss broadcasfing and mulficasfing in more defail, 
including how link-layer addressing can be used fo send mulficasf or broadcasf 
fraffic efficienfly from one compufer fo several ofhers. We also examine fhe Internet 
Group Management Protocol (IGMP) [RFC3376] and fhe IPv6 Mulficasf Lisfener Dis¬ 
covery (MLD) [RFC3810] protocols, which are used fo inform IPv4 and IPv6 mul¬ 
ficasf routers which mulficasf addresses are in use on a subnefwork. One topic we 
do nof cover in fhis chapfer (or fhis book) is how mulficasf roufing is implemented 
in wide area nefworks such as fhe global Infernef. Af fhe presenf fime, mulficasf is 
used more in enferprise and local nefworks fhan in fhe wide area. While fhe pro¬ 
tocols we discuss in fhis chapfer are prerequisites for a complefe undersfanding of 
wide area mulficasfing, fhe wide area roufing protocols are comparafively complex 
and would unnecessarily complicafe fhe explanafion of fhe imporfanf local area 
case. The reader inferesfed in exploring fhese issues is referred fo [EGW02]. 

Broadcasfing and mulficasfing provide fwo services for an applicafion: deliv¬ 
ery of packefs fo mulfiple desfinafions, and solicifafion/discovery of servers by 
clienfs. 

• Delivery to multiple destinations 

There are many applications that deliver information to multiple recipients: 
interactive conferencing and disseminafion of mail or news fo mulfiple 
recipienfs, for example. Wifhouf broadcasfing or mulficasfing, fhese fypes 
of services fend fo use TGP today (delivering a separafe copy fo each desfi- 
nafion, which can be very inefficienf). 
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• Solicitation of servers by clients 

Using broadcasting or multicasting, an application can send a request for a 
server without knowing any particular server's IP address. This capability 
is very useful during configuration when little is known about the local 
networking environment. A laptop, for example, might need to get its ini¬ 
tial IP address and find its nearest router using DHCP (see Chapter 6). 

Although both broadcasting and multicasting can provide these important 
capabilities, multicasting is generally preferable to broadcasting because multicast¬ 
ing involves only those systems that support or use a particular service or protocol, 
and broadcasting does not. Thus, a broadcast request affects all hosts that are reach¬ 
able within the scope of the broadcast, whereas multicast affects only those hosts 
that are likely to be interested in the request. These concepts will become clearer as 
we explore the details of broadcasting and multicasting. For now, keep in mind that 
there is a trade-off between the higher overhead and simplicity of broadcast and 
the improved efficiency but greater complexity associated with multicast. 

Broadcasting has been supported by the IPv4 protocol since its inception, 
and multicast was added with the publication of [RFC1112]. IPv6 supports multi¬ 
casting but does not support broadcasting. Generally, only user applications that 
use the UDP transport protocol (Chapter 10) take advantage of broadcasting and 
multicasting, where it makes sense for an application to send a single message to 
multiple recipients. TCP is a connection-oriented protocol that implies a connec¬ 
tion between two hosts (specified by IP addresses) and one process on each host 
(specified by port numbers). TCP can use unicast and anycast addresses (recall 
that anycast addresses behave like unicast addresses), but not broadcast or multi¬ 
cast addresses. 


Note 

Broadcasting and multicasting are also used by important system processes 
such as routing protocols, ARP, ND in IPv6, and others. Although IP multicasting 
support was once an “add-on,” requiring users to patch their systems to make 
use of it, modern operating systems include the capability by default. Multicast¬ 
ing is an important but arguably optional feature in IPv4, but it is mandatory in 
IPv6 because of its use in ND (see Chapter 8), a service critical even to unicast 
communication. 


9.2 Broadcasting 

Broadcasting refers to sending a message to all possible receivers in a network. In 
principle, this is simple: a router simply forwards a copy of any message it receives 
out of every interface other than the one on which the message arrived. Things are 
slightly more complicated when multiple hosts are attached to the same local area 
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network. In this case, features of fhe link layer may be used fo make broadcasfing 
somewhaf more efficienf. 

Consider a sef of hosfs on a nefwork such as an Efhernef fhaf supporfs broad¬ 
casfing af fhe link layer. Each Efhernef frame confains fhe source and desfinafion 
MAC addresses (48-bif values). Normally, each IP packef is desfined for a single 
hosf, so unicasf addressing is used and fhe desfinafion's unique MAC address is 
defermined using ARP or IPv6 ND. When a frame is senf fo a unicasf desfinafion 
in fhis way, communicafion befween any fwo hosfs does nof bofher any of fhe 
remaining hosfs on fhe nefwork. Por swifched Efhernef nefworks, fhese are fhe 
fypes of addresses found in fhe sfafion caches in swifches and bridges (see Chap- 
fer 3). There are fimes, however, when a hosf wanfs fo send a frame fo every ofher 
hosf on fhe nefwork (or VLAN)—fhis is called a broadcast. We saw fhis wifh ARP 
in Chapfer 4. 

9.2.1 Using Broadcast Addresses 

On an Efhernef or similar nefwork, a mulficasf MAC address has fhe low-order bif 
of fhe high-order byfe fumed on. In hexadecimal fhis looks like 01:00:00:00:00:00. 
We may consider fhe Efhernef broadcasf address ff:ff:ff:ff:ff:ff as a special case of 
fhe Efhernef mulficasf address. Erom Chapfer 2 recall fhaf in IPv4, each subnef has 
a local subnef-direcfed broadcasf address formed by placing all 1 bifs in fhe hosf 
porfion of fhe address, and fhe special address 255.255.255.255 corresponds fo a 
local nefwork (also called "limifed") broadcasf. 

9.2.1.1 Example 

In Linux, fhe IPv4 subnef-direcfed broadcasf address associafed wifh each infer- 
face can be found or sef wifh fhe if conf ig command. We can see if displayed as 
follows: 


Linux% ifconfig ethO 

ethO Link encap:Ethernet HWaddr 00:08:74:93:C8:3C 

inet addr:10.0.0.13 Beast:10.0.0.127 Mask:255.255.255.128 
inet6 addr: 2001:5c0:9ae2:0:208:74ff:fe93:c83c/64 
Scope:Global 

inet6 addr: fe80::208:74ff:fe93:c83c/64 
Scope:Link 

UP BROADCAST RUNNING MULTICAST MTU: 1500 Metric:! 

RX packets:426469 errors:0 dropped:0 overruns:! frame:0 
TX packets:779338 errors:0 dropped:0 overruns:0 carrier:0 
collisions:298048 txqueuelen:1000 

RX bytes:44414543 (42.3 MiB) TX bytes:1094425223 (1.0 GiB) 
Interrupt:19 Base address:OxecOO 


Here, fhe address 10.0.0.127 is fhe (subnef-direcfed) broadcasf address used 
on fhe nefwork fo which device ethO is affached. This address is formed by fak¬ 
ing fhe nefwork prefix (10.0.0.0/25) and combining if wifh 32 - 25 = 7 bifs of Is in 
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the host portion of the address: 10.0.0.0 OR 0.0.0.127 = 10.0.0.127. A simple utility 
called ipcalc is available on some systems to perform fhis calculafion. 

To see how simple broadcasfing works, we can send an ICMPv4 Echo Requesf 
message using fhe ping program fo fhe broadcasf address of 10.0.0.127 indi- 
cafed by fhe oufpuf of fhe ifconf ig command: 

Linux# ping -b 10.0.0.127 

WARNING: pinging broadcast address 

PING 10.0.0.127 (10.0.0.127) 56(84) bytes of data. 

64 bytes from 10.0.0.6: icmp_seg=l ttl=64 time=1.05 ms 
64 bytes from 10.0.0.113: icmp_seq=l ttl=64 time=1.55 ms (DUP!) 

64 bytes from 10.0.0.120: icmp_seq=l ttl=64 time=3.09 ms (DUP!) 

- 10.0.0.127 ping statistics - 

1 packets transmitted, 1 received, +2 duplicates, 

0% packet loss, time 0ms 


We menfioned in Chapfer 8 fhaf in fhis fype of broadcasf, all fhe hosfs on fhe 
local LAN (or VLAN) are affecfed. Here we receive replies from fhree ofher hosfs 
on fhe nefwork, and fhe ping program nofes fhaf more responses were received 
fhan fhe number of requesfs senf (fhe DUP ! indicafion). To see fhe addresses being 
used, we can invesfigafe fhe acfion using Wireshark (see Figure 9-1). 
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Figure 9-1 An ICMPv4 Echo Request message sent to the directed broadcast address on the local 
subnetwork is encapsulated in a link-layer broadcast frame with a destination address 
of all Is. 
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The Echo Request message is sent to the address 10.0.0.127. The IPv4 imple¬ 
mentation determines this to be the subnet-directed broadcast address by consult¬ 
ing information in the local routing table and interface configuration informafion, 
and if sends fhe dafagram using fhe link-layer broadcasf address ff:ff:ff:ff:ff:ff, so 
no ARP requesf is needed fo defermine fhe MAC addresses for each desfinafion. 
In facf, fhe sender is unaware of whaf hosfs will respond until fhey do. If knows 
only fhaf 10.0.0.127 is a broadcasf address and fhaf if should fherefore use a 
broadcasf link-layer desfinafion address when sending. The source addresses af 
bofh fhe IP and link layers are entirely conventional unicasf; mulficasf addresses 
are used only as desfinafion addresses. 

In fhis particular example, notice fhaf each of fhe responses generafed is 
direcfed af 10.0.0.13, fhe unicasf address of fhe original sender, and fhaf each 
response includes fhe IPv4 address of fhe responder: 10.0.0.6,10.0.0.113, and 
10.0.0.120. This is a simple example of a more general principle: broad¬ 
casf addressing (and mulficasf addressing, as we shall see shorfly) can be used 
fo discover sysfems or services fhaf are ofherwise unknown. In fhis example, 
fhe oufgoing broadcasf ping requesf discovered fhree hosfs fhaf are willing fo 
respond fo broadcasf Echo Requesf messages. 

9.2.2 Sending Broadcast Datagrams 

Generally speaking, applications using broadcast use the UDP protocol (or ICMPv4 
protocol) and invoke an ordinary set of API calls to send traffic. The only excep¬ 
tion is that when invoking the API calls, a special flag (SO_BROADCAST) is used 
in some operating systems to indicate that the application really does intend to 
send broadcast datagrams. Por example, in Linux, failing to use the -b flag when 
attempting to do a broadcast ping causes the following output: 

Linux% ping 10.0.0.127 

Do you want to ping broadcast? Then -b 

This error is caused because the SO_BROADCAST flag is provided through the 
API only when the -b option is provided in the command line. This helps to 
avoid accidentally generating broadcast traffic that could temporarily congest a 
network. 

To determine which interfaces are used for broadcasting, the IPv4 forwarding 
table (called "routing table" here) is consulted. The following is an example of a 
Windows Vista routing table (later versions of Windows use an identical format) 
showing the interface list and broadcast-related routing information (other 
information has been removed for clarity): 


C:\> netstat -rn 


Interface List 
10 ... 02 00 4c 4f 4f 50 
9 ...00 13 02 20 b9 18 


Microsoft Loopback Adapter 

Intel(R) PRO/Wireless 3945ABG Network 

Connection 
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IPv4 Route Table 


Active Routes: 

Network Destination 

Netmask 

Gateway 

Interface 

Metric 

0.0.0.0 

0.0.0.0 

10.0.0.1 

10.0.0.57 

25 

10.0.0.127 

255.255.255.255 

On-link 

10.0.0.57 

281 

127.255.255.255 

255.255.255.255 

On-link 

127.0.0.1 

306 

169.254.255.255 

255.255.255.255 

On-link 

169.254.57.240 

286 

255.255.255.255 

255.255.255.255 

On-link 

127.0.0.1 

306 

255.255.255.255 

255.255.255.255 

On-link 

169.254.57.240 

286 

255.255.255.255 

255.255.255.255 

On-link 

10.0.0.57 

281 


The first portion of this output shows seven different network interfaces that 
may be used for carrying network traffic. The first is the virtual loopback inter¬ 
face, the next is a Wi-Fi wireless interface, the third is a wired Ethernet interface 
(that is disconnected), the fourth is another loopback interface, and the next three 
are used as part of the nonstandard Intra-Site Automatic Tunnel Addressing Pro¬ 
tocol (ISATAP) [RFC5214][RFC5579]. ISATAP is used in supporting IPv6 hosts 
separated by an IPv4 network. 

Moving on to the routing table, we see that there are seven entries that could 
be used to determine where broadcast traffic should be sent. The first entry is the 
default route (mask 0.0.0.0), so it matches any destination. This could be used 
by broadcasts directed beyond the local network, if such a facility were enabled. 
This type of directed broadcast, which travels beyond the local network, is usu¬ 
ally disabled by routers to avoid a number of security problems, as suggested by 
[RFC2644]. 

The next three entries are the directed subnet broadcast addresses associated 
with the three interfaces having IPv4 addresses 10.0.0.57, 127.0.0.1, and 
169.254.57.24 0, respectively. The last two are software loopback interfaces. 
These entries show how Windows expresses a directed subnet broadcast route as 
the network prefix combined with all Is bits in the host part as the destination, 
and a /32 or 255.255.255.255 subnet mask. The Gateway column indicates On- 
link, so traffic is delivered using direct delivery (see Chapter 5) on the interface 
identified in the Interface column. In these cases, there is not more than one 
match for each subnet-directed broadcast address, so the Metric column is not 
consulted. 

The last three entries are routing entries for the limited broadcast address, 
255.255.255.255. In some ways, this address acts like a multicast address because 
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network. Thus, it is not immediately obvious which interface(s) should be used for 
sending traffic desfined for fhe limifed broadcasf address. Unforfunafely, Secfion 
3.3.6 of fhe Hosf Requiremenfs RFC [RFC1122] provides liffle guidance: 

There has been discussion on whether a datagram addressed to the Limited 

Broadcast address ought to be sent from all the interfaces of a multihomed host. 

This specification takes no stand on the issue. 

As a consequence, the way outgoing traffic to the limited broadcast address is 
handled is operating-system-specific. Most systems pick a single broadcast-capa¬ 
ble interface to use for sending such traffic. Linux and FreeBSD behave this way. 
FreeBSD actually converts the limited broadcast address into a subnet-directed 
broadcast address of the "primary" (first configured) interface, although an appli¬ 
cation can disable this behavior using the IP_ONESBCAST API option. Windows, 
for example, has behaved differently in different versions. Up to Windows 2000, 
limited broadcasts were forwarded over multiple interfaces. With Windows XP 
and later, the default behavior is to send over a single interface. In this example, 
there are multiple possible matching routes for such traffic, so the entry with the 
lowest metric (interface 10.0.0.57) is used. 


9.3 Multicasting 

To reduce the amount of overhead involved in broadcasting, it is possible to send 
traffic only to those receivers that are interested in it. This is called multicasting. 
Fundamentally, this is accomplished by either having the sender indicate the 
receivers, or instead having the receivers independently indicate their interest. 
The network then becomes responsible for sending traffic only to intended/inter¬ 
ested recipients. Implementing multicast is considerably more challenging than 
broadcast because multicast state (information) must be maintained by hosts and 
routers as to what traffic is of interest to what receivers. In the TCP/IP model of 
multicasting, receivers indicate their interest in what traffic they wish to receive 
by specifying a multicast address and optional list of sources. This information is 
maintained as soft state (see Chapter 4) within hosts and routers, meaning that it 
must be updated regularly or it will time out and be deleted. When this happens, 
delivery of multicast traffic either ceases or reverts to broadcast. 

The inefficiencies of broadcast apply not only to wide area networks, where 
they can be extremely severe, but also to local area and enterprise networks. Every 
host that can be reached on the same LAN or VLAN must process broadcast pack¬ 
ets. IP multicasting provides a more efficient way to carry out the same types of 
tasks. If IP multicasting is used properly, only those hosts involved or interested in 
the communication need to process the associated packets, traffic is carried only 
on those links where it will be used, and only one copy of any multicast datagram 
is carried on any such link. To make multicasting work, applications that wish 
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to be involved in a communication require a mechanism to notify their protocol 
implementations of fheir desires. The hosf soffware can fhen arrange fo receive 
packefs mafching fhe applicafions' criferia. 

IP mulficasfing originafed using a design based on fhe way group address¬ 
ing works in link-layer nefworks such as Efhernef. In fhis approach, each sfafion 
selecfs fhe group address for which if is willing fo accepf fraffic, irrespecfive of 
fhe sender. This approach is also somefimes called any-source mulficasf (ASM) 
because of fhe insensifivify fo fhe idenfify of fhe sender. As IP mulficasfing has 
evolved, an alfernafive form fhaf is sensifive fo fhe idenfify of fhe sender called 
source-specific mulficasf (SSM) [RFC4607] has been developed fhaf allows end 
sfafions fo explicifly include or exclude fraffic senf fo a mulficasf group from a 
parficular sef of senders. The SSM service model is easier fo implemenf fhan ASM, 
primarily because in wide area mulficasfing if is easier fo defermine fhe locafion 
of a single source fhan fhe locafions of many sources. In fhe local area, however, 
much of fhe machinery involved in supporfing eifher ASM or SSM is idenfical, so 
we freaf fhem fogefher and explain fhe few differences when fhey are imporfanf. 
We begin by invesfigafing how IP mulficasf fraffic makes use of MAC-layer mulfi¬ 
casf addresses on mulficasf-capable IEEE LAN fechnology. 

9.3.1 Converting IP Multicast Addresses to 802 MAC/Ethernet Addresses 

When using unicast addresses on Ethernet-like networks, ARP (see Chapter 4) 
is usually used to determine a local destination's MAC address given its IPv4 
address. In IPv6, ND serves a similar role (see Chapter 8). When we looked at 
broadcasting earlier, we noticed that there is a single well-known broadcast MAC 
address that can always be used to reach all stations on a LAN or VLAN. What 
destination MAC address should be placed in a link-layer frame when we wish 
to send multicast traffic? Ideally, we would not have to use a protocol message 
to determine the appropriate MAC address but could instead simply map an IP 
multicast address directly to some corresponding MAC address. To see how this 
is done, we shall focus on IEEE 802 networks, especially Ethernet and Wi-Ei. These 
networks represent the most common types of networks where IP multicasting is 
used. We will first discuss how the mapping works with IPv4, and then move on 
to the slightly different method used with IPv6. 

To carry IP multicast efficiently on a link-layer network, there should be a one- 
to-one mapping between packets and addresses at the IP layer and frames at the 
link layer. The lANA owns the IEEE Organizationally Unique Identifier (abbrevi¬ 
ated OUI, or more informally Ethernet address prefix) 00:00:5e. With it, lANA 
is given the right to use group (multicast) MAC addresses starting with 01:00:5e 
as well as unicast addresses starting with 00:00:5e. This prefix is used as the 
high-order 24 bits of the Ethernet address, meaning that this block includes uni¬ 
cast addresses in the range 00:00:5e:00:00:00 through 00:00:5e:ff:ff:ff and group 
addresses in the range 01:00:5e:00:00:00 through 01:00:5e:ff:ff:ff. Other organiza¬ 
tions besides lANA own address blocks as well, but only lANA devotes some of 
its space to support of IP multicasting. 
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The lANA allocates half of ifs group block fo idenfifying IPv4 mulficasf fraffic 
on IEEE 802 LANs. This means fhaf fhe Efhernef addresses corresponding fo IPv4 
mulficasfing are in fhe range 01:00:5e:00:00:00 fhrough 01:00:5e:7f:ff:ff. 


Note 

Our notation here uses the Internet standard bit order as the bits appear in mem¬ 
ory. This is what most programmers and system administrators deal with. The 
IEEE documentation uses the transmission order of the bits. 


The mapping of IPv4 addresses fo fheir corresponding IEEE 802-sfyIe link- 
layer addresses can be seen in Eigure 9-2. 
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Example: 224.0.1.17 ^ 01:00:5e:01:11 

Figure 9-2 The IPv4-to-IEEE-802 MAC multicast address mapping uses the lower-order 23 bits of 
the IPv4 group address as the suffix of a MAC address starting with 01:00:5e. Because 
only 23 of the 28 group address bits are used, 32 groups are mapped to the same MAC- 
layer address. 


Recall from Chapfer 2 fhaf all IPv4 mulficasf addresses are confained wifhin 
fhe address space from 224.0.0.0 fo 239.255.255.255 (formerly known as class D 
address space). All such addresses share a common 4-bif sequence of 1110 in fhe 
high-order bifs. Thus, fhere are 32 - 4 = 28 bifs available fo encode fhe enfire space 
of 2^* = 268,435,456 mulficasf IPv4 addresses (also called group IDs). Por IPv4, fhe 
lANA policy of allocafing half of ifs group addresses for use in supporfing IPv4 
mulficasf means fhaf all 268,435,456 IPv4 mulficasf group IDs need fo be mapped 
info a link-layer address space confaining only 2^ = 8,388,608 unique enfries. The 
mapping fherefore is nonunique. Thai is, more fhan one IPv4 group ID is mapped 
fo fhe same MAC-layer group address. Specifically, = 2^ = 32 disfincf IPv4 

mulficasf group IDs are mapped fo each group address. Por example, bofh fhe 
mulficasf addresses 224.128.64.32 (hexadecimal eO.80.40.20) and 224.0.64.32 (hexa¬ 
decimal eO.00.40.20) are mapped info fhe Efhernef address 01:00:5e:00:40:20. 

Por IPv6, fhe 16-bif OUI hexadecimal prefix is 33:33. This means fhaf fhe lasf 32 
bifs of fhe IPv6 address can be used fo form fhe link-layer address. Thus, any address 
ending wifh fhe same 32 bifs maps fo fhe same MAC address (see Eigure 9-3). Given 
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that all IPv6 multicast addresses begin with ft, and the subsequent 8 bits are used 
for flags and scope informafion, fhis leaves 128 - 16 = 112 bifs for represenfing 2^^^ 
groups. Thus, wifh fhe 32 bifs of MAC-layer address available fo encode fhese groups, 
fhere can be as many as = 2*“groups fhaf map fo fhe same MAC address! 


- Lower 112 bits of IPv6 Group Address (Not to Scale) 


1 

111111 

-i- 

Flags 

I; (4 bits) 

Scope 
(4 bits) 


IPv6 Multicast 
Address 


(80 bits) I 


Bits Replaced 


15 16 


33 


33 


IEEE 802 
MAC 
Address 


127 


Lower 32 bits of IPv6 Group Address 


32 Bits Copied 


Example: ff02::1:ff68:12cb 
33:33:ff:68:12:cb 


1 


47 


Figure 9-3 The IPv6-to-IEEE-802 MAC multicast address mapping uses the low-order 32 bits of the 
IPv6 multicast address as the suffix of a MAC address starting with 33:33. Because only 
32 of the 112 multicast address bits are used, 2®“ groups are mapped to the same MAC- 
layer address. 


9.3.2 Examples 

In a previous example, we used a subnet broadcast address to determine all the 
hosts on the local subnet that would respond to a broadcast ICMPv4 Echo Request 
message. Here, because we can use multicast addressing to determine hosts that 
offer a particular service, we can send an ICMPv4 echo request to those hosts that 
respond to the Multicast DNS (mDNS [CKll]) address 224.0.0.251: 


Linux% ping 224.0.0.251 

PING 224.0.0.251 (224.0.0.251) 56(84) bytes of data. 

64 bytes from 10.0.0.2: icmp_seq=l ttl=60 time=1.10 ms 
64 bytes from 10.0.0.11: icmp_seq=l ttl=60 time=1.60 ms (DUP!) 

64 bytes from 10.0.0.120: icmp_seq=l ttl=64 time=2.59 ms (DUP!) 

- 224.0.0.251 ping statistics - 

1 packets transmitted, 1 received, +2 duplicates, 

0% packet loss, time 0ms 

rtt min/avg/max/mdev = 1.109/1.767/2.590/0.615 ms 

Here, hosts 10.0.0.2,10.0.0.11, and 10.0.0.120 all respond, indicating that 
they are subscribed to the mDNS group. Notice that these hosts are not the same 
ones that responded when we used the broadcast address of 10.0.0.127. This is not 
so surprising, as not all hosts support the mDNS protocol. 
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Note 

Multicast DNS (mDNS) is a service designed to support zero configuration (effort- 
iess system and device configuration). mDNS has been supported on Appie sys¬ 
tems where it is part of Bonjour. Microsoft has promoted an aiternative protocol that 
includes similar features known as Link Local Multicast Name Resolution (LLMNR) 
[RFC4795]. Neither protocol is currently an Internet standard within the IETF, but at 
present mDNS enjoys a longer history than LLMNR. See Chapter 11 for more details. 


For IPv6, we can perform a similar operation using an ICMPv6 Echo Request 
message: 


Linux% pingS -I ethO ff02::fb 

PING ff02::fb(ff02::fb) from feSO::208:74ff:fe93:c83c ethO: 

56 data bytes 

64 bytes from fe80::217:f2ff:fee7:6d91: icmp_seq=l ttl=64 time=2.76 ms 
- ff02::fb ping statistics - 

1 packets transmitted, 1 received, 0% packet loss, time 0ms 
rtt min/avg/max/mdev = 2.768/2.768/2.768/0.000 ms 

Note that in this case, we provide the outgoing interface as input to the ping6 
program. This allows the program to select the appropriate outgoing IPv6 address 
in Windows XP. As we can see in Figure 9-4, the address selected is a link-local 
address associated with the ethO device. 


ping6-mdns.tr - Wireshark |^~||^]^^ 


File Edit View Go Capture Analyze Statistics Telephony Tools Help 

iIbIqi 

No. Time Source Destination Protocol Info 


10.000000 fe80::208:74ff:fe93:c83c ff02::fb icmpv 6 Echo (ping) request id=0xld47, seq=l 


20.002238 fe80::217:f2ff:fee7:6d91 fe80::208:74ff:fe93:c83c lCMPv6 Echo (ping) reply id=0xld47, seq=l v 

> 

a Frame 1: 118 bytes on wire (944 bits), 118 bytes captured (944 bits) 

a Ethernet II, Src: 00:08:74:93:c8:3c (00:08:74:93:c8:3c), Dst: 33:33:00:00:00:fb (33:33:00:00:00:fb) 

M Destination: 33:33:00:00:00:fb (33:33:00:00:00:fb) 

Address: 33:33:00:00:00:fb (33:33:00:00:00:fb) 

.1. = IG bit: Group address (multicast/broadcast) 

.1.= LG bit: Locally administered address (this is not the factory default) 

ffl source: 00:08:74:93:c8:3c (00:08:74:93:c8:3c) 

Type: IPv6 (0x86dd) 

a Internet Protocol version 6, src: feSO::208:74ff:fe93:c83c (fe80::208:74ff:fe93:c83c), Dst: ff02::fb (ff02::fb) 
a 0110 _ = version: 6 

a_ 0000 0000 .= Traffic class: 0x00000000 

. 0000 0000 0000 0000 0000 = Flowlabel: 0x00000000 

payload length: 64 

Next header: icmpv 6 (0x3a) 

Hop limit: 64 

source: fe80::208:74ff:fe93:c83c (fe80::208:74ff:fe93:c83c) 

[source SA mac: 00:08:74:93:c8:3 c (00:08:74:93:c8:3 c)] 

Destination: ff02::fb (ff02::fb) 

a Internet Control Message Protocol v6 v 

> 


Figure 9-4 An ICMPv6 Echo Request message is sent from a link-local unicast address associated with the 
ethO network interface to the multicast address ff02::fb. The reply includes the sender's IPv6 link- 
local IPv6 address. 
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The packets are identified as ICMPv6 Echo Request/Reply messages with the 
Identifier field sef fo 0xld47 and Sequence Number field sef fo 1. The source IPv6 
addresses are link-local in all cases. The desfinafion address of fhe requesf is fhe 
mulficasf address ff02::fb, which is mapped fo fhe MAC address 33:33:00:00:00:fb. 
The Echo Reply message is senf direcfly fo fhe link-local IPv6 unicasf address 
of fhe sender, fe80::208:74ff:fe93:c83c, from fhe responder's link-local unicasf 
address, fe80::217:f2ff:fee7:6d91. Nofe fhaf fhe sender of fhe Echo Reply message 
arranges fo use a source IPv6 address of fhe same scope (see fhe discussion on 
source address selecfion in Chapfer 5, and compare Pigure 9-4 wifh Eigure 5-16). 


9.3.3 Sending Multicast Datagrams 

When sending any IP packet, a decision must be made as to which source address 
and interface to use. This is especially true for IPv6, where having multiple 
addresses per interface is considered normal. To help determine this, we can look 
at the forwarding table present in the host. In either Windows or Linux, the net- 
stat command can be used. Here are the IPv4 and IPv6 routing tables as output 
on Windows Vista (later versions use an identical format): 


C:\> netstat -rn 

... interface list . . . 

IPv4 Route Table 


Active Routes: 


Network Destination 

Netmask 

Gateway 

Interface 

Metric 

0.0.0.0 

0.0.0.0 

10.0.0.1 

10.0.0.57 

25 

224.0.0.0 

240.0.0.0 

On-link 

127.0.0.1 

306 

224.0.0.0 

240.0.0.0 

On-link 

169.254.57.240 

286 

224.0.0.0 

240.0.0.0 

On-link 

10.0.0.57 

281 

255.255.255.255 

255.255.255.255 

On-link 

127.0.0.1 

306 

255.255.255.255 

255.255.255.255 

On-link 

169.254.57.240 

286 

255.255.255.255 

255.255.255.255 

On-link 

10.0.0.57 

281 


Persistent Routes: 
None 


IPv6 Route Table 


Active Routes: 

If Metric Network Destination Gateway 


9 

281 

: :/0 


fe80::204:5aff:fe9f:9e80 

1 

306 

ffOO: 

:/8 

On-link 

10 

286 

ffOO: 

:/8 

On-link 

9 

281 

ffOO: 

:/8 

On-link 


Persistent Routes: 


None 
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From this table we can see that a default route for IPv4 fraffic goes fo 10.0.0.1 
using inferface 10.0.0.57. Alfhough fhis does mafch mulficasf fraffic, fhere are 
ofher enfries fhaf are more specific. The enfries lisfed as 224.0.0.0/4 (subnef 
mask 240.0.0.0) indicafe fhaf fhree differenf inferfaces can carry oufgoing mulfi¬ 
casf fraffic. The inferface wifh fhe lowesf mefric (10.0.0.57, wifh mefric 281) is fhe 
mosf preferred, so if is used unless an applicafion specifies ofherwise. For IPv6, 
all mulficasf addresses begin wifh ff, and fhere are no broadcasf addresses, so 
inferfaces 1, 9, and 10 can all be used. Inferface 9 (which happens fo be fhe same 
inferface used for IPv4 and fhe defaulf for IPv6 unicasf fraffic) has fhe lowesf mef¬ 
ric. Addifional informafion indicafing which inferfaces have which IP addresses 
can be defermined using fhe Windows command ipconf ig /all. 

The oufpuf on Linux is separafe for differenf protocol families (such as IPv4 
and IPv6). If is generated by differenf argumenfs fo fhe netstat command, fo 
indicafe which version of IP (or ofher) profocol is of inferesf. For IPv4, fhere is 
nofhing fo show, as fhere is no special enfry for mulficasf; a convenfional defaulf 
roufe handles fhe mulficasf fraffic. For IPv6, however, we can see fhe following: 


Linux% netstat -rn -A inetS 

Kernel IPv6 routing table 

Destination Next Hop Flags Metric Ref Use Iface 

ff00::/8 :: U 256 0 0 ethO 

In fhis case, fhere is no direcf "nexf hop," so fhe unspecified address (::) is 
lisfed in fhe fable, buf we can see fhaf fhe oufgoing inferface is ethO. The Flags 
column confains only U, indicafing fhaf fhe roufe is usable, buf fhe lack of a G flag 
indicafes fhaf if is an on-link roufe, nof requiring forwarding fo a router. 

9.3.4 Receiving Muiticast Datagrams 

Fundamenfal fo mulficasfing is fhe concepf of a process joining or leaving one or 
more mulficasf groups on a given inferface on a hosf. (We use fhe ferm process fo 
mean a program being executed by fhe operaf ing sysfem, offen on behalf of a user.) 
Membership in a mulficasf group on a given inferface is dynamic—if changes over 
fime as processes join and leave groups. In addifion fo joining or leaving groups, 
addifional mefhods are needed if a process wishes fo specify sources if cares fo 
hear from or exclude. These are required parfs of any API on a hosf fhaf supporfs 
mulficasfing. We use fhe qualifier "inferface" because membership in a group is 
associated wifh an inferface. A process can join fhe same group on mulfiple infer¬ 
faces, mulfiple groups on fhe same inferface, or any combinafion fhereof. 

9.3.4.1 Example 

If is possible fo determine whaf mulficasf groups are in use on each inferface using 
an operafing-sysfem-specific command. In Windows, fhe commands are parf of 
fhe netsh package. For IPv6, fhis works as follows (for IPv4, replace ipv6 wifh ip): 



448 Broadcasting and Local Multicasting (IGMP and MLD) 


C:\> netsh interface ipv6 show joins 

Interface 1: Loopback Pseudo-Interface 1 
Scope References Last Address 


0 1 Yes ff02: : c 

Interface 8: Local Area Connection 


Scope 

References 

Las 

Address 

0 

0 

Yes 

ffOl 


1 

0 

0 

Yes 

ff02 


1 

0 

1 

Yes 

ff02 


c 

0 

1 

Yes 

ff02 


1:3 

0 

1 

Yes 

ff02 


l:ffdc:fc85 


Here we can see how IPv6 uses several multicast addresses per interface. The 
first interface is a loopback, local inferface. The only mulficasf group used on if is 
fhe link-local scoped Simple Service Discovery Profocol (SSDP) mulficasf address, 
which we saw in Chapfer 7. 


Note 

SSDP is described in an (expired) Internet draft [GCLG99] authored by Microsoft 
and Hewlett-Packard. SSDP also operates on IPv4, using address 239.255.255.250 
and UDP port 1900. 


On the other network interface, the addresses f f 01: :1 (node-local All Nodes 
address) and f f 02: :1 (link-local All Nodes address) show joins for all nodes, and 
ff02: :c shows the use of SSDP. The next address, ff02: :1:3, is for support of 
LLMNR, a local multicast name resolution system mentioned previously and dis¬ 
cussed in more detail in Chapter 11. Finally, the address ff02: :l:ffdc:fc85 is 
the Solicited-Node multicast address for this node, used by IPv6 ND. Recall that 
in IPv6, determining a neighbor's MAC address is accomplished using multicast 
ICMPv6 ND messages, as opposed to the ARP mechanism used in IPv4. 

On Linux, the netstat command displays the IP group memberships: 


Linux% netstat -gn 

IPv6/IPv4 Group Memberships 
Interface RefCnt Group 


lo 1 

ethl 1 

lo 1 

ethl 1 

ethl 1 


224.0.0.1 
224.0.0.1 
ff02 : :1 

ff02::l:ff2a:1988 
ff02 : :1 


The output from this command includes the join information for multiple 
interfaces and for both IPv4 and IPv6. In this case, we see 224.0.0.1 (All Hosts) 
on both the Ethernet interface (ethl) as well as the local loopback interface (lo). 
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We can also see the link-local scope All Nodes bindings for each interface. Finally, 
the Solicited-Node address is f f 02: :1: f f2a:1988. 


Note 

With IP multicasting, a process may send to a multicast group without joining 
it. More commonly, processes do join the multicast groups with which they are 
interacting, and on one or more specific interfaces. There is a special option in the 
socket API (IP_MULTICAST_LOOP) to alter the way multicast traffic is handled 
among processes on the same host that are members of the same group on 
the same interface. In UNIX, this option applies to the send path, meaning that 
if the option is enabled, other processes on the same host receive the multi¬ 
cast datagrams, even if they have the option disabled. Conversely, on Windows, 
the option applies on the receive path, meaning that any processes enabling the 
option receive multicast traffic from other applications on the same host even if 
they have the option disabled. 


9.3.5 Host Address Filtering 

To understand how the operating system processes received multicast datagrams 
for multicast groups that programs have joined, recall from Chapter 3 that filtering 
takes place on each host's network interface card (NIC), each time a frame is pre¬ 
sented to it (e.g., by a bridge or switch) for possible reception. Figure 9-5 indicates 
how this occurs. 

In a typical switched Ethernet environment, broadcast and multicast frames 
are replicated on all segments within a VLAN, along a spanning tree formed 
among the switches. Such frames are delivered to the NIC on each host which 
checks the correctness of the frame (using the CRC) and makes a decision about 
whether to receive the frame and deliver it to the device driver and network stack. 
Normally the NIC receives only those frames whose destination address is either 
the hardware address of the interface or the broadcast address. However, when 
multicast frames are involved, the situation is somewhat more complicated. 

NICs tend to come in two varieties. One type performs filtering based on 
the hash values of the multicast hardware addresses in which the host software 
has expressed interest, which means that some unwanted frames can always get 
through because of hash collisions. The other type listens for a finite table of mul¬ 
ticast addresses, meaning that if the host needs to receive frames destined for 
more multicast addresses than can fit in the table, the NIC is put into a "multi- 
cast-promiscuous" mode, in which case all multicast traffic is given to the host 
software. Hence, both types of interfaces require that the device driver or higher- 
layer software perform checking that the received frame is really wanted. Even 
if the interface performs perfect multicast filtering (based on the 48-bit hardware 
address), because the mapping from a multicast IPv4 or IPv6 address to a 48-bit 
hardware address is not unique, filtering is still required. Despite this imperfect 
address mapping and hardware filtering, multicasting is still more efficient than 
broadcasting. 
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Delivered to Process 


Discard 

(Unknown Destination 
IPv4 Address) 



Packet Arrives 


Figure 9-5 Each layer implements filtering on some portion of the received message. MAC address 
filtering can take place in either software or hardware. Cheaper NICs tend to impose a 
larger processing burden on software because they perform fewer functions in hardware. 


For NICs that support a multi-entry address table, the destination address on 
each received frame is compared against this table, and if fhe address is found in 
fhe fable, fhe frame is received and processed by fhe device driver. The enfries of 
fhis fable are managed by fhe device driver soffware in combinafion wifh ofher 
layers of fhe profocol sfack (such as fhe IPv4 and IPv6 implemenfafions). Anofher 
mefhod of implemenfing fhis type of filfering is fo apply a hash funcfion fo fhe 
desfinafion address, forming an index info a (smaller) binary vecfor. When fhe 
indexed enfry in fhe vecfor confains a 1 bif, fhe corresponding address is deemed 
fo be accepfable and fhe frame is processed furfher. This approach can save mem¬ 
ory on fhe NIC, buf because of collisions in fhe hash funcfion, some frames may be 
considered admissible when fhey should nof be. This is nof a fafal problem, how¬ 
ever, because higher layers of fhe sfack also perform filfering, and no frames are 
ever discarded when fhey should nof have been (i.e., fhere are no false negafives, 
buf fhere may be false posifives). 


Note 

The specific capabilities of an NIC vary based on manufacturer. As an example, 
the Intel 82583V Ethernet controller includes a 16-entry exact match table (uni¬ 
cast or multicast), a 4096-bit hash filter for multicast destinations, and support for 
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both promiscuous reception and promiscuous muiticast reception in addition to 
filtering based on up to 4096 VLAN tags. 


Once the NIC hardware has verified a frame as accepfable (i.e., fhe CRC is cor- 
recf, any VLAN fags mafch, and fhe desfinafion MAC address mafches an address 
enfry in one or more of fhe NIC's fables), fhe frame is passed fo fhe device driver, 
where addifional filfering is performed. Firsf, fhe frame type must specify a pro¬ 
tocol fhaf is supporfed (e.g., IPv4, IPv6, ARP, efc.). Second, addifional mulficasf 
filfering may be performed fo check whefher fhe hosf belongs fo fhe addressed 
mulficasf group (indicafed by fhe desfinafion IP address). This is necessary for 
NICs fhaf may generate false posifives. 

The device driver fhen passes fhe frame fo fhe nexf layer, such as IP, if fhe 
frame type specifies an IP dafagram. IP performs more filfering, based on fhe 
source and desfinafion IP addresses, and passes fhe dafagram up fo fhe nexf layer 
(such as TCP or UDP) if all is well. Each fime UDP receives a dafagram from IP, 
if performs filfering based on fhe desfinafion porf number, and somefimes fhe 
source porf number, too. If no process is currenfly using fhe desfinafion porf 
number, fhe dafagram is discarded and an ICMPv4 or ICMPv6 Porf Unreachable 
message is normally generafed. (TCP performs similar filfering based on ifs porf 
numbers.) If fhe UDP dafagram has a checksum error, UDP silenfly discards if. 

One of fhe primary mofivafions behind fhe developmenf of fhe mulficasf 
addressing feafures was fo avoid fhe overhead of broadcasfing. Consider an appli- 
cafion fhaf is designed fo use UDP broadcasfs. If fhere are 50 hosfs on fhe nefwork 
(or VLAN), buf only 20 are parficipafing in fhe applicafion, every fime one of fhe 
20 sends a UDP broadcasf, fhe ofher 30 nonparficipafing hosfs have fo process fhe 
broadcasf, all fhe way up fhrough fhe UDP layer, before fhe UDP dafagram is dis¬ 
carded. The UDP dafagram is discarded by fhese 30 hosfs because fhe desfinafion 
porf number is nof in use. The infenf of mulficasfing is fo reduce fhis load on hosfs 
wifh no inferesf in fhe applicafion. Wifh mulficasfing, a hosf specifically joins one 
or more mulficasf groups. If possible, fhe NIC is fold which mulficasf groups fhe 
hosf belongs fo, and only fhose mulficasf frames associated wifh fhe IP-layer mul¬ 
ficasf groups are allowed fhrough fhe filter in fhe NIC. All of fhis machinery offers 
less overhead imposed on fhe hosf, in exchange for addifional complexify in man¬ 
aging mulficasf addresses and group memberships. 


9.4 The Internet Group Management Protocol (IGMP) and 
Multicast Listener Discovery Protocol (MLD) 

So far we have discussed how mulficasf dafagrams are fransmiffed, filtered, 
and received from a hosf's perspecfive. When mulficasf dafagrams are fo be for¬ 
warded over a wide area nefwork or wifhin an enferprise across mulfiple sub- 
nefs, we require fhaf multicast routing be enabled by one or more mulficasf roufers. 
This complicafes fhe sifuafion considerably, because mulficasf roufers require 
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knowledge about which hosts are interested in what multicast groups, in order 
to arrange for multicast traffic fo be delivered appropriafely. They also execufe a 
special procedure called fhe Reverse Path Forwarding (RPF) check. This procedure 
performs a roufing lookup on fhe source address of an arriving mulficasf dafa- 
gram. Only if fhe oufgoing inferface for roufing mafches fhe inferface on which 
fhe dafagram arrived is fhe dafagram forwarded. The RPF check is imporfanf for 
avoiding mulficasf loops. Mulficasf roufing is largely separafe from convenfional 
unicasf roufing provided by IP roufers. However, some capabilifies of mulficasf 
roufing are required for fhe IPv6 ND protocol (see Chapter 8) fo operate properly. 

Two major protocols are used fo allow mulficasf roufers fo learn fhe groups in 
which nearby hosfs are interested: fhe Internet Group Management Protocol (IGMP) 
used by IPv4 and fhe Multicast Listener Discovery (MLD) profocol used by IPv6. Bofh 
are used by hosfs and roufers fhaf supporf mulficasfing, and fhe protocols are very 
similar. These profocols lef fhe mulficasf roufers on a LAN (VLAN) know which 
hosfs currenfly belong fo which mulficasf groups. This informafion is required by 
fhe roufers so fhaf fhey know which mulficasf dafagrams fo forward on fo which 
inferfaces. In mosf cases, a mulficasf router only requires knowledge fhaf at least one 
lisfening hosf is reachable by a parficular inferface, as link-layer mulficasf address¬ 
ing (assuming if is supported) permifs fhe mulficasf roufer fo send link-layer mulfi¬ 
casf frames fhaf will be received by all inferesfed listeners. This allows a mulficasf 
roufer fo do ifs job wifhouf keeping frack of every individual hosf on each inferface 
fhaf mighf be inferesfed in mulficasf fraffic for a parficular group. 

IGMP has evolved over fime, and [RFG3376] defines version 3 (fhe mosf cur- 
renf one af fhe fime of wrifing). MLD has evolved in parallel, and ifs currenf 
version (2) is defined in [RFG3810]. IGMPv3 and/or MLDv2 are required for sup- 
porfing SSM. See [RFG4604] for more defails on how fhese profocols are resfricfed 
when using only a single source per mulficasf group. 

Version 1 of IGMP was fhe firsf commonly used version of IGMP. Version 
2 added fhe abilify fo leave groups more quickly (also supporfed by MLDvl). 
IGMPv3 and MLDv2 add fhe abilify fo selecf fhe sources of mulficasf fraffic and 
are required for deploymenf of SSM. While IGMP is a separafe profocol used wifh 
IPv4, MLD is really parf of IGMPv6 (see Ghapfer 8). 

Figure 9-6 indicafes how IGMP (MLD) is used by an IPv4 (IPv6) mulficasf- 
enabled roufer. Such roufers are inferesfed in ascerfaining which mulficasf groups 
are of inferesf on each of ifs affached inferfaces. These roufers require fhis infor¬ 
mafion in order fo avoid simply broadcasfing all fraffic ouf of every inferface. 

In Figure 9-6, we can see how IGMP (MLD) queries are senf by mulficasf rouf¬ 
ers. These are senf fo fhe All Hosfs mulficasf address, 224.0.0.1 (IGMP), or fhe All 
Nodes link-scope mulficasf address, ff02::l (MLD), and processed by every hosf 
implemenfing IP mulficasf (see fhe excepfion in Secfion 9.4.2 for "specific" que¬ 
ries). Membership reporf messages are senf by group members (hosfs) in response 
fo fhe queries buf may also be senf in an unsolicited way from hosfs fhaf wish 
fo inform mulficasf roufers fhaf fheir group membership(s) and/or inferesf in 
parficular sources has changed. IGMPv3 reporfs are senf fo fhe IGMPv3-capable 
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Figure 9-6 Multicast routers send IGMP (MLD) requests to each attached subnet periodically to 
determine which groups and sources are of interest to the attached hosts. Hosts respond 
with reports indicating which groups and sources are of interest. Hosts may also send 
unsolicited reports if membership changes occur. 


multicast router address 224.0.0.22. MLDv2 reports are sent to the correspond¬ 
ing MLDv2 Listeners IPv6 multicast address ff02::16. Note that multicast routers 
themselves may also act as members when they join multicast groups. 


Note 

In IGMPvl and IGIVIPv2, after receiving a query, hosts do not respond immedi¬ 
ately but instead may wait a smail random amount of time to see if any other host 
responds for the same group. If so, a host’s response is suppressed (not sent). 
This is accomplished by having reports sent to the muiticast address of the group 
in question. Appendix A of [RFC3376] indicates why this operation was removed 
in iGMPv3. In short, multicast routers may wish to track individuai hosts’ subscrip¬ 
tions, suppression does not work weli in bridged LANs using IGMP snooping (see 
Section 9.4.7), handiing suppression complicates the protocoi implementation, 
and IGMPv3 reports contain information on multiple groups, making successfui 
suppression iess likeiy. Note that both IGMPv3 and MLDv2 require backward 
compatibility with earlier versions of themselves and revert to using oider-version 
protocoi messages of oider hosts or routers detected on the same subnet. 


The encapsulations for IGMP and MLD are shown in Figure 9-7. Like ICMP, 
IGMP is considered part of the IP layer. Also like ICMP, IGMP messages are trans¬ 
mitted in IPv4 datagrams. Unlike other protocols that we have seen, IGMP uses 
a fixed TTL of 1, so packets are limited to the local subnetwork. IGMP packets 
also use the IPv4 Router Alert option and use the 6-bit value 0x30 in the DS Field 
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Destination: 224.0.0.1 (All Hosts—General Queries), Group Address (Specific Queries/IGMPv2), or 
224.0.0.22 (AII-IGMPv3-routers—Reports) 

_ ^ype: Query (17), v3 Report (34), v1 Report (18), v2 Report (2 2), v2 leave (23) 
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I Router Alert Option Type: Query (130), v2 Report (143) 
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MLD Data 
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T;5, L: 2, 
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CD 

CL 

Header 

(Source List) 

Link-Local Source Address or:: 



NH=59 



NH = 58 





(40 bytes) (6) (2) (4 bytes) (variable) 

- IPv6 Datagram - 


Figure 9-7 IGMP is encapsulated as a separate protocol in IPv4. MLD is a type of ICMPv6 message. 


to represent Internetwork Control (CS6, see Chapter 5). In IPv6, MLD is part of 
ICMPv6, but the functionality of MLD is nearly idenfical fo fhaf of IGMP, so we 
describe if here (we described ifs message formafs briefly when describing ICMPv6 
in Chapfer 8). Ifs encapsulafion makes use of an IPv6 Hop-by-Hop exfension header 
fo hold fhe Roufer Alerf opfion. In many cases, fhe lisf of sources is empfy 

IGMP and MLD define fwo sefs of protocol processing rules: fhose performed 
by hosfs fhaf are group members and fhose performed by mulficasf roufers. Gen¬ 
erally speaking, fhe job of fhe member hosfs (which we will call "group members") 
is fo sponfaneously reporf changes in inferesf in mulficasf groups and sources 
and fo respond fo periodic queries. Mulficasf roufers send queries fo ascerfain 
whefher any inferesf is presenf on an affached link for any groups, or for a specific 
mulficasf group and source. Roufers also inferacf wifh wide area mulficasf profo- 
cols (such as PIM-SM and BIDIR-PIM) fo bring fhe desired fraffic fo fhe inferesfed 
hosfs or prohibif fraffic from flowing fo uninferesfed hosfs. For more defails on 
fhese protocols, please see [RFG4601] and [RFG5015]. 

9.4.1 IGMP and MLD Processing by Group Members (“Group Member Part”) 

The group members' porfion of IGMP and MLD is designed fo allow hosfs fo 
specify whaf groups fhey are inferesfed in and whefher fraffic senf from parficu- 
lar sources should be accepfed or filtered ouf. This is accomplished by sending 
reporfs fo one or more mulficasf roufers (and parficipafing hosfs) affached fo fhe 
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same subnet. Reports may be sent as a result of receiving a query, or spontane¬ 
ously (unsolicited) because of a local change in reception state (e.g., an application 
joins or leaves a group). IGMP reports take the form shown in Figure 9-8. 
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15 16 


31 


Type (0x22) 

(8 bits) 

Reserved (0) 

(8 bits) 

Checksum (16 bits) 

Reserved (16 bits) 

Number of Group Records {N) 

(16 bits) 


Group Record [1] i 

- i 

Group Record [2] ' 



Basic 
IGMP 
Report 
Header 
}' (8 bytes) 


I 

I 

I 

I Group Record [A/] 


Figure 9-8 The IGMPv3 membership report contains group records for N groups. Each group 
record indicates a multicast address and optional list of sources. 


Report messages are fairly simple. They contain a vector of group records, each 
of which provides information about a particular multicast group, including the 
address of the subject group, and an optional list of sources used for establishing 
filters (see Figure 9-9). 

Each group record contains a type, the address of the subject group, and a list 
of source addresses to either include or exclude. There is also support for includ¬ 
ing auxiliary data, but this feature is not used by IGMPvS. Table 9-1 reveals the sig¬ 
nificant flexibility that can be achieved using IGMPvS report record types. MLD 
uses the same values. A list of sources is said to refer to include mode or exclude 
mode. In include mode, the sources in the list are the only sources from which 
traffic should be accepted. In exclude mode, the sources in the list are the ones to 
be filtered out (all others are allowed). Leaving a group can be expressed as using 
an include mode filter with no sources, and a simple join of a group (i.e., for any 
source) can be expressed as using the exclude mode filter with no sources. Note 
that when using SSM, types 0x02 and 0x04 are not used, as only a single source is 
assumed for any group. 
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0 15 16 31 


Record Type 

Aux Data Len 

Number of Group Sources {N) 

(8 bits) 

(8 bits) 

(16 bits) 

IPv4 Multicast (Group) Address (32 bits) 


Source Address [1] 

_ 

Source Address [2] 


Source Address [A/] 

Auxiliary Data 


Figure 9-9 An IGMPv3 group record includes a multicast address (group) and an optional list of 
sources. Groups of sources are either allowed as senders (include mode) or filtered out 
(exclude mode). Previous versions of IGMP reports did not include a list of sources. 


Table 9-1 Type values for IGMP and MLD source lists indicate the filtering mode (include or exclude) and 
whether the source list has changed 


Type 

Name and Meaning 

When Sent 

0x01 

MODE_IS_INCLUDE (IS_IN): traffic sent from any of 
fhe associated source addresses is not to be filtered. 

In response to a query from a 
multicasf roufer 

0x02 

MODE_IS_EXCLUDE (IS_EX): traffic senf from any of 
the associated source addresses should be filtered. 

In response to a query from a 
multicasf roufer 

0x03 

CHANGE_TO_lNCLUDE_MODE (TO_IN): a change 
from exclude mode; fraffic sent from any of fhe 
associafed source addresses should now not be filtered. 

In response to a local action 
changing the filter mode from 
exclude fo include 

0x04 

CHANGE_TO_EXCLUDE_MODE (TO_EX): a change 
from include mode; fraffic sent from any of the 
associated source addresses should now be filtered. 

In response to a local action 
changing the filter mode from 
include fo exclude 

0x05 

ALLOW_NEW_SOURCES (ALLOW): a change in 
source list; traffic sent from any of the associated source 
addresses should now not be filtered. 

In response fo a local acfion 
changing the source list to 
allow new sources 

0x06 

BLOCK_OLD_SOURCES (BLOCK): a change in source 
list; traffic sent from any of the associated source 
addresses should now be filtered. 

In response to a local action 
changing the source list to 
disallow previously allowed 
sources 
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The first two message types (0x01,0x02) are known as current-state records and 
are used to report the current filter state in response to a query. The next two 
(0x03, 0x04) are known as filter-mode-change records, which indicate a change from 
include to exclude mode or vice versa. The last two (0x05, 0x06) are known as 
source-list-change records and indicate a change to the sources being handled in 
either exclude or include mode. The last four types are also described more gener¬ 
ally as state-change records or state-change reports. These are sent as a result of some 
local state change such as a new application being started or stopped, or a running 
application changing its group/source interests. Note that IGMP and MLD que¬ 
ries/reports themselves are never filtered. MLD reports use a structure similar to 
IGMP reports but accommodate larger addresses and use an ICMPv6 type code of 
143 (see Chapter 8). 

When receiving a query, group members do not respond immediately. Instead, 
they set a random (bounded) timer to determine when to respond. During this 
delay interval, processes may alter their group/source interests. Any such modi¬ 
fications can be processed together before a timer expires to trigger the report. In 
this way, once the timer does expire, the status of multiple groups can more likely 
be merged into a single report, saving overhead. 

The source address used for IGMP is the primary or preferred IPv4 address 
of the sending interface. For MLD, the source address is a link-local IPv6 address. 
One complication arises when a host is booting and attempting to determine its 
own IPv6 address. During this time, it selects a potential IPv6 address to use and 
executes the duplicate address detection (DAD) procedure (see Chapter 6) to deter¬ 
mine if any other systems are already using this address. Because DAD involves 
multicast, some source address must be assigned to outgoing MLD messages. This 
is addressed by [RFC3590], which allows the unspecified address (::) to be used as 
the source IPv6 address for MLD traffic during configuration. 

9.4.2 IGMP and MLD Processing by Multicast Routers (“Multicast Router Part”) 

In IGMP and MLD, the job of the multicast router is to determine, for each multi¬ 
cast group, interface, and source list, whether at least one group member is pres¬ 
ent to receive corresponding traffic. This is accomplished by sending queries and 
building state describing the existence of such members based on the reports they 
send. This state is soft state, meaning that it is cleared after a certain amount of 
time if not refreshed. To build this state, multicast routers send IGMPv3 queries of 
the form depicted in Figure 9-10. 

The IGMP query message is very similar to the ICMPv6 MLD query we dis¬ 
cussed in Chapter 8. In this case, the group (multicast) address is 32 bits in length 
and the Max Resp Code field is 8 bits instead of 16. The Max Resp Code field encodes 
the maximum amount of time the receiver of the query should delay before send¬ 
ing a report, encoded in 100ms units for values below 128. For values above 127, the 
field is encoded as shown in Figure 9-11. 
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Figure 9-10 The IGMPv3 query includes the multicast group address and optional list of sources. 

General queries use a group address of 0 and are sent to the All Hosts multicast address, 
224.0.0.1. The QRV value encodes the maximum number of retransmissions the sender 
will use, and the QQIC field encodes the periodic query interval. Specific queries are 
used before terminating traffic flow for a group or source/group combination. In this 
case (and all cases with IGMPv2 or IGMPvl), the query is sent to the address of the 
subject group. 


0 


7 


Exponent 
(3 bits) 


Mantissa (4 bits) 


Max Resp Time = (mantissa + 16) * 

Figure 9-11 The Max Resp Code field encodes the maximum time to delay responses in 100ms units. 

For values above 127, an exponential value can be used to accommodate larger values. 


This encoding provides for a possible range of (16)(8) = 128 fo (31)(1024) = 
31,744 (i.e., abouf 13s fo 53 minufes). Using smaller values for fhe Max Resp Code 
field allows for funing fhe leave latency (fhe elapsed fime from when fhe lasf group 
member leaves fo fhe fime corresponding fraffic ceases fo be forwarded). Larger 
values of fhis field reduce fhe fraffic load of fhe IGMP messages generafed by 
members by increasing fhe likelihood of longer periods for reporfing. 

The remaining fields in a query include an Infernef-sfyle checksum across fhe 
whole message, fhe address of fhe subjecf group, a lisf of sources, and fhe S, QRV, 
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and QQIC fields we defined in Chapfer 8 wifh MLD. In cases where fhe mulficasf 
roufer wishes fo know abouf inferesf in all mulficasf groups, fhe Group Address 
field is sef fo 0 (such queries are called "general queries"). The S and QRV fields 
are used for faulf tolerance and refransmission of reporfs and are discussed in 
Secfion 9.4.5. The QQIC field is fhe Querier’s Query Interval Code. This value is fhe 
query sending period, in unifs of seconds and encoded using fhe same mefhod as 
fhe Max Resp Code field (i.e., a range from 0 fo 31,744). 

There are fhree varianfs of fhe query message fhaf can be senf by a mulficasf 
roufer: general query, group-specific query, and group-and-source-specific query. The 
firsf form is used by fhe mulficasf roufer fo update informafion regarding any 
mulficasf group, and for such queries fhe group lisf is empfy. Group-specific que¬ 
ries are similar fo general queries buf are specific fo fhe idenfified group. The lasf 
fype is essenfially a group-specific query wifh a sef of sources included. The spe¬ 
cific queries are senf fo fhe desfinafion IP address of fhe subjecf group, as opposed 
fo general queries fhaf are senf fo fhe All Sysfems mulficasf address (for IPv4) or 
fhe link-scope All Nodes mulficasf address for IPv6 (ff02::l). 

The specific queries are senf in response fo sfafe-change reporfs in order fo 
verify fhaf if is appropriafe for fhe roufer fo fake some acfion (e.g., fo ensure fhaf no 
inferesf remains in a parficular group before consfrucfing a filfer). When receiv¬ 
ing eifher filfer-mode-change records or source-lisf-change records, fhe mulficasf 
roufer arranges fo add new fraffic sources and may be able fo filfer ouf fraffic from 
cerfain sources. In cases where fhe mulficasf roufer is prepared fo begin filfer- 
ing ouf fraffic fhaf was flowing previously, if uses fhe group-specific query and 
group-and-source-specific query firsf. If fhese queries elicif no reporfs, fhe roufer 
is free fo begin filtering ouf fhe corresponding fraffic. Because such changes can 
significanfly affecf fhe flow of mulficasf fraffic, sfafe-change reporfs and specific 
queries are refransmiffed (see Secfion 9.4.5). 

9.4.3 Examples 

Figure 9-12 shows a packef frace confaining a combinafion of IGMPv2, IGMPv3, 
MLDvl, and MLDv2 protocols, all working on fhe same subnef. The frace is 16 
packefs in lengfh (fhe firsf 10 are shown in Figure 9-12) and begins wifh an MLD 
query from fe80::204:5aff:fe9f:9e80, fhe link-local IPv6 address of fhe querier. 
Recall fhaf MLD and MLDv2 use fhe same query formaf. This same system also 
acfs as an IGMP querier using fhe IPv4 source address 10.0.0.1. 

In Figure 9-12, fhe MLD query (packef 1) is senf by fhe querier using ifs link- 
local IPv6 address fe80::204:5aff:fe9f:9e80 fo fhe mulficasf address ff02::l (All 
Nodes). The MAG-layer addresses are 00:04:5a:9f:9e:80 and 33:33:00:00:00:01, 
respecfively. Here we can see how an IPv6 link-local unicasf address relates fo fhe 
corresponding MAG address, and also how fhe All Nodes address is mapped fo 
fhe MAG address using prefix 33:33, as we discussed earlier. The IPv6 Hop Limit 
field is sef fo 1, as MLD messages are applicable only fo fhe local link. The IPv6 
Payload Length field indicafes 36 byfes, which includes 8 byfes holding fhe MLD 
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86.108808 10.0.0.1 224.0.0.1 IGMP V3 Membership Query, general 

97.163508 fe80::fd26:de93:5ab7:405a ff02::16 ICMPv6 Multicast Listener Report Message v2 

10 7.653561 10.0.0.14 224.0.1.60 IGMP V2 Membership Report / Soin group 224.0.1.60 v 

> 

IS Frame 1: 90 bytes on wire (720 bits), 90 bytes captured (720 bits) 

a Ethernet II, Src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 33:33:00:00:00:01 (33:33:00:00:00:01) 
a internet protocol version 6, src: fe80::204:5aff:fe9f:9e80 (fe80::204:5aff:fe9f:9e80), Dst: ff02::l (ff02::l) 
a 0110 _ = version: 6 

a_ 0000 0000 .= Traffic class: 0x00000000 

. 0000 0000 0000 0000 0000 = Flowlabel: 0x00000000 

Payload length: 36 

Next header: IPv6 hop-by-hop option (0x00) 

Hop limit: 1 

source: fe80::204:5aff:fe9f:9e80 (fe80::204:5aff:fe9f:9e80) 

[Source SA MAC: 0Q:O4:5a:9f:9e:80 (00:04:5a:9f:9e:80)] 

Destination: ff02::l (ff02::l) 
a Hop-by-Hop Option 

Next header: icmpv 6 (0x3a) 

Length: 0 (8 bytes) 

Router alert: mld (4 bytes) 

PadN: 2 bytes 

a internet control Message protocol v6 
Type: 130 (Multicast listener query) 
code: 0 (unknown) 

Checksum: 0x5c73 [correct] 

Maximum response delay[ms]: 10000 
Multicast Address: :: 
s Flag: off 
R obustness: 2 
QQI: 125 


Figure 9-12 IGMPv2, IGMPv3, MLDvl, and MLDv2, all working on the same subnet. The highlighted packet 
is an MLD query. 


form of Roufer Alerf (a Hop-by-Hop opfion), 4 byfes of ICMPv6 header informa- 
fion, and 24 byfes fo hold fhe MLD dafa ifself. The Type, Code, Checksum, and Max 
Response fields of fhe MLD message fogefher require 8 byfes of fhe 24; 16 more 
are used fo hold fhe Multicast Address field (sef fo 0/unknown or fhe unspecified 
address fo refer fo all groups). The S bif field, QRV, and QQIC fields fogefher use 
2 more byfes, and fhe lasf 2 hold fhe number of sources idenfified, which in fhis 
case is 0. In fhis example, we see defaulf values for all MLD informafion: 10s for 
fhe maximum response delay, QRV = 2, and 125s for fhe query inferval. The nexf 
message (packef 2, Figure 9-13) is fhe response for fhe query 

Figure 9-13 is an MLDv2 reporf indicafing inferesf in fhe mulficasf address 
ff02::c (fhe link-local mulficasf address for SSDP). Inferesf is indicafed in such 
reporfs using an exclude mode reporf confaining an empfy source lisf. The nexf 
few packefs of fhe frace show fhe use of MLDvl (sfill used by some sysfems). 
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a Frame 2: 90 bytes on wire (720 bits), 90 bytes captured (720 bits) 

a Ethernet ii, src: 00:13:02:20:b9:18 (00:13:02:20:b9:18), Dst: 33:33:00:00:00:16 (33:33:00:00:00:16) 
a Internet protocol version 6, src: fe80::fd26:de93:5ab7:405a (fe80::fd26:de93:5ab7:405a), ost: ff02::16 (ff02::16) 
a Internet control Message Protocol v6 

Type: 143 (Multicast Listener Report Message v2) 
code: 0 (should always be zero) 

Checksum: Oxfb32 [correct] 

Reserved: 0 (should always be zero) 

Number of records: 1 
a Exclude: ff02::c (ff02::c) 

Mode: Exclude (2) 

Aux data len: 0 
Number of Sources: 0 
Multicast Address: ff02::c 


Figure 9-13 An MLDv2 listener report message expresses interest in the group ff02::c (the link-local scope 
multicast address for SSDP) by using an exclude-type message with no sources. 
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a Frame 3: 86 bytes on wire (688 bits), 86 bytes captured (688 bits) 

a Ethernet II, src: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d:91), Dst: 33:33:74:08:ff:56 (33:33:74:08:ff:56) 
a Internet Protocol version 6, src: fe80::217:f2ff:fee7:6d91 (fe80::217:f2ff:fee7:6d91), Dst: ff02::2:7408:ff56 
a Internet control Message Protocol v6 
Type: 131 (Multicast listener report) 

Code: 0 (Should always be zero) 
checksum: 0x37d3 [correct] 

Maximum response delay: 0 
Multicast Address: ff02::2:7408:ff56 

_ > 

Figure 9-14 The MLDvl report message expresses an interest in the multicast address ff02::2:7408:ff56, which 
is also the destination IPv6 address. 

Packets 3 through 5 in Figure 9-14 are all MLDvl reports. Only packet 3 is 
shown here, as the others are similar (they differ only in fheir respecfive desfina- 
fion IPv6 addresses). As wifh MLDv2, each reporf uses fhe same sfrucfure for fhe 
IPv6 base and exfension headers, buf fhe desfinafion address of fhe reporf is fhe 
mulficasf address of inferesf, ff02::2:7408:ff56. Nofe fhaf af fhe MAC layer, fhis des¬ 
finafion address is mapped fo 33:33:74:08:ff:56. The nexf porfion of fhe frace, sfarf- 
ing wifh packef 6 in Figure 9-15, shows how MLDv2 can reporf mulfiple inferesfs. 
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Figure 9-15 This MLDv2 report expresses interest in five multicast groups. Each multicast address 
record reports interest in a single group by indicating that no sources are to be excluded 
(i.e., mode is exclude with no associated sources). 


Packet 6 in Figure 9-15 is the first MLDv2 report indicating interest in more 
than one multicast address. In this case, it is from fe80::204:5aff:fe9f:9e80 (fhe 
MLD querier) and confains informafion for five groups: ff02::16 (all MLDv2- 
capable roufers), ff02::l:ff00:0 (firsf solicifed-node address), ff02::2 (All Roufers), 
ff02::202 (ONC RPC, a form of remofe procedure call), and ff02::l:ff9f :9e80 (ifs own 
solicifed-node group). Packef 7 (nof defailed) is an MLDv2 reporf indicafing fhaf 
hosf fe80::fd26:de93:5ab7:405a has inferesf in address ff02::l:ffb7:405a, ifs solicifed- 
node address. We now move on fo fhe non-IPv6 fraffic in fhe frace as shown in 
Figure 9-16. 

Packef 8 in Figure 9-16 is fhe firsf IPv4 packef of fhe frace, and if is an IGMPv3 
general query from fhe querier 10.0.0.1. The packef is senf fo fhe All Nodes 
address, 224.0.0.1, and fhis mulficasf address is mapped fo fhe link-layer address 
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5 igmp mid.td - Wireshark 


File yit View Go Capture Analyze Statistics Telephony lools Help 

Si Q. eietetn 

No. Time Source Destination Protocol Info 

7 5.163634 fe80:;fd26:de93:5ab7:405a ff02::16 ICMPv6 Multicast Listener Report Message v2 


8 6.108808 10.0.0.1 224.0.0.1 IGMP V3 Membership Query, general 


9 7.163508 fe80: :fd26:de93: 5ab7:405a ff02::16 iCMPv6 Multicast Listener Report Message v2 v 

> 

S Frame 8: 50 bytes on wire (400 bits), 50 bytes captured (400 bits) | 

a Ethernet II, Src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 01:00:5e:00:00:01 (01:00:5e:00:00:01) 

Q internet protocol, src: 10.0.0.1 (10.0.0.1), Dst: 224.0.0.1 (224.0.0.1) 
version: 4 

Header length: 24 bytes 

a Differentiated Services Field: OxcO (dscp 0x30: class selector 6; ECN: 0x00) 

Total Length: 36 
Identification: 0x3d65 (15717) 
a Flags: 0x00 

Fragment offset: 0 
Time to live: 1 
Protocol: IGMP (2) 
a Header checksum: Oxfcac [correct] 
source: 10.0.0.1 (10.0.0.1) 

Destination: 224.0.0.1 (224.0.0.1) 
a Options: (4 bytes) 

Router Alert: Every router examines packet 
a Internet Group Management Protocol 
[IGMP version: 3] 

Type: Membership Query (0x11) 

Max Response Time: 10.0 sec (0x64) 

Header checksum: Oxecle [correct] 

Multicast Address: 0.0.0.0 (0.0.0.0) 
a QRV=2 s=Do not suppress router side processing 

.... 0... = S: Do not suppress router side processing 

.010 = QRV: 2 

QQIC: 125 

Num src: 0 v| 

' > I 


Figure 9-16 An IGMPv3 general membership query is sent to the All Nodes multicast address, 
224.0.0.1. Its IPv4 header contains a DSCP value of 0x30 (class selector 6) and the IPv4 
Router Alert option. 


01:00:5e:00:00:01. The TTL is set to 1, as IGMP messages are not forwarded 
through routers. The IPv4 header is 24 bytes, which is 4 bytes larger than a basic 
IPv4 header in order to hold the 4-byte Router Alert option. This particular packet 
is an IGMPv3 membership query, with the default maximum response time of 10s 
and query inferval of 125s. The mulficasf address (group) idenfified is O.O.O.O, so 
fhis is a general query requesfing knowledge abouf all mulficasf groups in use. 
Packef 9 (nof defailed buf similar fo packefs 7 and 2) is an inferspersed MLDv2 
response, indicafing inferesf in fhe mulficasf address ff02::l:3 (LLMNR). The lasf 
seven packefs are shown in Figure 9-17. 

Packef 10 in Figure 9-17 is an IGMPv2 membership reporf sen! from 10.0.0.14 
(a nefwork-affached prinfer) fo 224.0.1.60, which is a discovery service used for 
equipmenf manufacfured by Hewleff-Packard. As wifh MLDvl, IGMPv2 mes¬ 
sages are senf fo fhe IP address of fhe group being referenced. Such messages have 
TTL = 1, include fhe Roufer Alerf opfion, and are 32 byfes in lengfh (24 byfes of 
IPv4 header plus 8 byfes of IGMP reporf informafion). 
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? igmp-mid.td - Wireshdrk 


RIe Edit View Go Capture Analyze Statistics Telephony lools H^lp 


HUB 


B|a! stetetn Miswssia 


Protocol Info 


119.436635 10.0.0.14 

12 9.518934 feSO::208:74ff:fe93:c83c ff02: 

13 10.163390 10.0.0.57 224.0.0.22 

1412.16330010.0.0.57 224.0.0.22 

1512.64023610.0.0.13 224.0.0.22 

16 12.753031 10.0.0.14 224.0.0.251 


239.255.255.250 IGMP V2 Membership Report / loin group 239.255.255.250 
ICMPV6 Multicast Listener Report Message v2 


IGMP 

IGMP 

IGMP 

IGMP 


V3 Membership Report / Join group 239.255.255.250 for any sources 

V'3 Membership Report / Join group 224.0.0.252 for any sources 

v3 Membership Report / Join group 224.0.0.251 for any sources 

v2 Membership Report / Join group 224.0.0.251 


9 Frame 10: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) 

a Ethernet II, Src: 00:le:0b:df:9e:b3 (00:le:0b:df:9e:b3), Dst: 01:00:5e:00:01:3c (01:00:5e:00:01:3c) 
a internet protocol, src: 10.0.0.14 (10.0.0.14), ost: 224.0.1.60 (224.0.1.60) 
version: 4 

Header length: 24 bytes 

a Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00) 

Total Length: 32 
Identification: 0xcb7f (52095) 
a Flags: 0x00 

Fragment offset: 0 
Time to live: 1 
Protocol: IGMP (2) 
a Header checksum: 0x6e0e [correct] 
source: 10.0.0.14 (10.0.0.14) 

Destination: 224.0.1.60 (224.0.1.60) 

□ options: (4 bytes) 

Router Alert: Every router examines packet 
a Internet Group Management Protocol 
[IGMP version: 2] 

Type: Membership Report (0x16) 

Max Response Time: 0.0 sec (0x00) 

Header checksum: 0x08c3 [correct] 

Multicast Address: 224.0.1.60 (224.0.1.60) 


Figure 9-17 Packet 10 is detailed along with the last seven packets, which are a mix of lGMPv2 and IGMPv3 
membership reports (except packet 12). IGMPv2 reports do not contain source-specific 
information. 


The remaining packets are not detailed as they are similar to other packets we 
have already seen in detail. Packet 11 reports that the same system, 10.0.0.14, wishes 
to join the group 239.255.255.250 (part of UPnP). Packet 12 is an MLDv2 report indi¬ 
cating that the host fe80::208:74ff:fe93:c83c is interested in the multicast addresses 
ff02::202 (ONC RPC) and ff02::l:ff93:f83c (its solicited-node address). Packets 13 
and 14 are IGMPv3 reports indicating that the host with IPv4 address 10.0.0.57 has 
interest in groups 239.255.255.250 and 224.0.0.252 (LLMNR), respectively. The last 
two packets indicate that hosts 10.0.0.13 and 10.0.0.14 wish to join group 224.0.0.251 
(mDNS; see Chapter 11). They are IGMPv3 and IGMPv2 reports, respectively. 

9.4.4 Lightweight iGMPvS and MLDv2 

As we have seen, hosts maintain filter state about what multicast groups their 
applications and system software are interested in. With IGMPv3 or MLDv2 they 
also maintain a list of sources fhaf are excluded or included. Mulficasf roufers 
mainfain similar sfafe in order fo know whaf fraffic needs fo be forwarded on fo 
a link for receipf by inferesfed hosfs. The reverse is also frue: a mulficasf roufer 
can forgo forwarding mulficasf fraffic senf from a hosf fhaf is in every receiver's 
exclude lisf. Pracf ical experience has shown, however, fhaf applicaf ions rarely need 
fo block specific sources, and supporf for fhis funcfion is somewhaf complicafed. 
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However, hosts often wish to include a specific source associated with a group, 
especially when SSM is in use. As a consequence, simplified versions of IGMPvS 
and MLDv2, called Lightweight IGMPvS (LW-IGMPv3) and Lightweight MLDv2 
(LW-MLDv2), respectively, have been defined in [RFG5790]. 

LW-IGMPv3 and LW-MLDv2 are subsets of their progenitors. They support 
both ASM and SSM and use a message format compatible with IGMPv3 and 
MLDv2, but they lack the specific source-blocking function. Instead, the only 
exclude mode supported is the case with no sources listed, which corresponds to a 
conventional group join in all versions of IGMP or MLD (e.g., as with Figure 9-13). 
For a multicast router, this means that the only state required is to keep track of 
which groups are of interest, and possibly which sources are of interest. It does not 
need to keep track of any individual sources that are not desired. 

Table 9-2 shows the modifications in message types used in the lightweight 
variants of IGMPv3 and MLDv2. In this table, the empty set notation ({}) indicates 
a null source address list. For example, TO_EX({}) indicates a message of type 0x04 
indicating a change to EXCLUDE mode with no associated sources. The notation 
{*, G) indicates group G associated with any sources, and the notation (S, G) indi¬ 
cates group G associated with specific source S. 


Table 9-2 Comparison of operations of full versions of IGMPv3 and MLDv2 and their "lightweight" 
counterparts, LW-lGMPv3 and LW-MLDv2 


Full 

Lightweight 

When Sent 

IS_EX(jl) 

TO_EX(|)) 

Query response for (*, G) join 

IS_EX(S) 

N/A 

Query response for EXCLUDE (S, G) join 

IS_IN(S) 

ALLOW(S) 

Query response for INCLUDE (S, G) join 

ALLOW(S) 

ALLOW(S) 

INCLUDE (S, G) join 

BLOCK(S) 

BLOCK(S) 

INCLUDE (S, G) leave 

TO_IN(S) 

TO_IN(S) 

Change to INCLUDE (S, G) join 

TO_IN(|l) 

TO_IN(ll) 

(*, G) leave 

TO_EX(S) 

N/A 

Change to EXCLUDE (S, G) join 

TO_EX(|)) 

TO_EX(|)) 

(*, G) join 


Compare the values in Table 9-2 with those in Table 9-1. Notably, the non-null 
EXCLUDE modes are not used and the state indicator types have been removed. In 
addition, the current-state records (IS_EX and IS_EN) have been removed for com¬ 
pliant hosts. Lightweight multicast routers are still supposed to be able to receive 
such messages but may treat them as though they always contain a null source list. 

9.4.5 IGMP and MLD Robustness 

There are two main concerns with the robustness and reliability of the IGMF and 
MLD protocols. Eailures of IGMF or MLD, or multicast more generally, can lead 
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to either the distribution of unwanted multicast traffic or fhe inabilify fo deliver 
desired mulficasf fraffic. The fypes of failures handled by IGMP and MLD include 
fhe failure of a mulficasf roufer and fhe loss of protocol messages. 

The pofenfial failure of a mulficasf roufer is handled by allowing more fhan 
one mulficasf roufer fo operafe on fhe same link. As menfioned previously, in fhis 
configurafion fhe roufer wifh fhe lowesf IP address is elecfed fhe "querier." The 
querier is responsible for sending general and specific queries fo defermine fhe 
currenf sfafe of hosfs on fhe subnef. Ofher (non-querier) routers monitor fhe pro¬ 
tocol messages, because fhey are also group members or mulficasf-promiscuous 
listeners, and a differenf roufer is able fo step in as fhe querier should fhe currenf 
querier fail. To make fhis work properly, all fhe mulficasf roufers affached fo fhe 
same link need fo coordinate fheir queries, responses, and some of fheir configu¬ 
rafion informafion (primarily fimers). 

The firsf type of coordinafion fhaf mulfiple mulficasf roufers accomplish is 
querier election. Each mulficasf roufer can hear fhe ofhers' queries. When a mulfi¬ 
casf roufer sfarfs, if believes ifself fo be fhe querier and sends a general query fo 
defermine whaf groups are acfive on a subnef. When a roufer receives a mulfi¬ 
casf query from anofher roufer, if compares fhe source IP address wifh ifs own. If 
fhe source IP address in fhe received query is smaller fhan ifs own, fhe receiving 
roufer enfers a sfandby mode. As a resulf, fhe roufer wifh fhe lowesf IP address is 
deemed fhe winner and becomes fhe single querier responsible for sending que¬ 
ries fo ifs affached subnef. Roufers fhaf are sfanding by sef fimers, and if fhey do 
nof see more queries wifhin a specified period of fime (called fhe other-querier- 
present fimer), fhey become queriers again. 

The querying mulficasf roufer sends periodic general queries fo defermine 
which groups and hosfs are of inferesf fo fhe hosfs on fhe same subnef. The rafe 
af which fhese queries are senf is determined by fhe querier's query interval, a con¬ 
figurable fimer paramefer. When more fhan one mulficasf roufer operates on fhe 
same subnef, fhe interval of fhe currenf querier is adopfed by all ofher roufers. 
In fhis way, if fhe currenf querier fails, a swifch fo an alfernafive mulficasf roufer 
does nof perfurb fhe periodic query rafe. 

A mulficasf roufer fhaf has reason fo believe a group (or source) is no longer 
of inferesf sends specific queries prior fo disconfinuing fhe forwarding of fhe cor¬ 
responding mulficasf fraffic (or informing fhe mulficasf roufing protocol). These 
queries are senf wifh a differenf inferval (called fhe Last Member Query Time or 
LMQT) from fhaf of general queries. The LMQT is fypically lower (shorter) fhan 
fhe query inferval and governs fhe leave latency A complicafion can arise when 
mulfiple mulficasf roufers operafe on fhe same subnef, hosfs wish fo leave groups 
(or drop sources), and protocol messages are losf. 

To help guard againsf losf protocol messages, some messages are refransmif- 
fed up fo a small number of fimes (defermined by fhe querier robusfness variable 
or QRV). The QRV value is encoded in fhe QRV field included in queries, and non¬ 
querying roufers adopf fhe querier's QRV as fheir own. Once again, fhis helps fo 
keep consistency if a change of querier occurs. The fypes of messages profecfed 
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with retransmission include state-change reports and specific queries. Other mes¬ 
sages (current-state reports) do not typically result in a change of forwarding 
state but instead only involve refreshing soft state by adjusting timers, so they are 
not protected using retransmission. When retransmissions do occur, the retrans¬ 
mission interval of reports is chosen at random uniformly between 0 and a con¬ 
figurable parameter called the Unsolicited Report Interval, and the retransmission 
interval for queries is periodic (with the interval based on the LMQT). Links that 
are expected to be more prone to loss (e.g., wireless links) may require increasing 
the robustness variable to increase robustness to packet loss at the expense of gen¬ 
erating additional traffic. 

To help keep multicast routers synchronized when handling specific queries, 
the S bit field in the query message indicates that router-side (timer) processing 
should be suppressed. When a specific query is sent by the querier, a number 
(QRV) of retransmissions are scheduled. In the first query sent, the S bit field is 
clear. Upon transmission or receipt of such queries, a multicast router lowers its 
timer for subsequent retransmissions to the LMQT. At this point, it is possible for 
an interested host to provide a report indicating its continued interest in a group 
or source. If no messages are lost, the report causes each multicast router to reset 
its timer to its ordinary value and continue without change. However, the sched¬ 
uled retransmissions are not abandoned. Instead, retransmissions of the specific 
query are sent with the S bit field set, which causes receiving routers to not lower 
their timers to the LMQT. 

The reason for keeping query retransmissions even after the receipt of a report 
expressing interest is so that the timeouts for groups across all multicast routers 
can be made consistent. The purpose of the S bit field, then, is to allow specific que¬ 
ries to be (re) sent, but to avoid lowering the timer to LMQT because a legitimate 
report expressing interest may have been received, even if it or the initial query 
was missed by the non-querier router(s). Without this capability, retransmitted 
specific queries would cause non-querier routers to lower their timers incorrectly 
(because a legitimate report indicating interest had already been received). 

9.4.6 IGMP and MLD Counters and Variables 

IGMP and MLD are soft-state protocols that also deal with failures of routers, loss 
of protocol messages, and interoperability with earlier protocol versions. Much 
of the machinery to enable these capabilities is based on timers that trigger state 
changes and protocol actions. Table 9-3 provides a summary of all of the configu¬ 
ration parameters and state variables used by IGMP and MLD. 

In Table 9-3, it is clear that MLD and IGMP share most of their timers and 
configuration parameters, although in some cases the terminology is different. 
Some values, those indicated as "cannot be changed," are set as a function of other 
values and are not independently changeable. 
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Table 9-3 Parameters and timer values for IGMP and MLD. Most values can be altered as configuration 
parameters in an implementation. 


Name and Meaning 

Default Value 
(Restrictions) 

Robustness Variable (RV)—arranges for up to RV - 1 retransmissions for some 
state-change reports/queries. 

2 (must not be 0; 
should not be 1) 

Query Interval (QI)—time between general queries sent by the current 
querier. 

125s 

Query Response Interval (QRl)—the maximum response time to wait for 
generation of reports. This value is encoded to form the Max Response field. 

10s 

Group Membership Interval (GMl) in IGMP and Multicast Address Listening 
Interval (MALI) in MLD—the amount of time that must pass without seeing a 
report for a multicast router to declare that there is no remaining interest in a 
group or source/group combination. 

RV * QI + QRI 
(cannot be 
changed) 

Other Querier Present Interval in IGMP and Other Querier Present Timeout 
in MLD—the amount of time that must pass without seeing a general request 
for a non-querier multicast router to declare that there is no longer an active 
querier. 

RV * QI + (0.5) 
QRI (cannot be 
changed) 

Startup Query Interval—the interval between general queries used by a 
querier just starting up. 

(0.25) * QI 

Startup Query Count—the number of general queries sent by a querier just 
starting up. 

RV 

Last Member Query Interval (LMQI) in IGMP and Last Listener Query 

Interval (LLQI) in MLD—the maximum response time to wait for generation 
of reports responding to specific queries. This value is encoded to form the 

Max Response field in specific queries. 

Is 

Last Member Query Count in IGMP and Last Listener Query Count in MLD— 
the number of specific queries to send without receiving a response to declare 
that there is no longer an interested host. 

RV 

Unsolicited Report Interval—the time between retransmissions of a host's 
initial state-change report. 

Is 

Older Version Querier Present Timeout—the amount of time a host waits 
without receiving an IGMPvl or IGMPv2 request message to revert back to 
IGMPv3. 

RV * QI + QRI 
(cannot be 
changed) 

Older Host Present Interval in IGMP and Older Version Host Present Timeout 
in MLD—the amount of time a querier waits without receiving an IGMPvl or 
IGMPv2 report message to revert back to IGMPv3. 

RV * QI + QRI 
(cannot be 
changed) 


9.4.7 IGMP and MLD Snooping 

IGMP and MLD manage the flow of IP mulficasf fraffic among roufers. To opfi- 
mize fraffic flow even furfher, if is possible for layer 2 swifches (fhaf would nof 
ordinarily process layer 3 IGMP or MLD messages) fo become aware of whefher 
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certain multicast traffic flows are of inferesf or nof by looking af layer 3 informa- 
fion. This capabilify is indicafed by a swifch feafure known as IGMP (MLD) snoop¬ 
ing [RFC4541] and is supporfed by many swifch vendors. Wifhouf IGMP snooping, 
swifches fypically send link-layer mulficasf fraffic by broadcasfing if along all fhe 
branches of fhe spanning free formed among swifches. This can be wasfeful for 
fhe reasons we described earlier. IGMP (MLD)-aware (somefimes called IGS for 
IGMP snooping) swifches monitor IGMP (MLD) fraffic befween hosfs and mul¬ 
ficasf routers and are able to keep frack of which porfs require which parficu- 
lar mulficasf flows in much fhe same way as a mulficasf router does. Doing so 
can subsfanfially affecf fhe amounf of unwanted mulficasf fraffic being carried 
fhrough a swifched nefwork. 

There are a few defails fhaf complicate fhe sfraighfforward implemenfafion of 
IGMP/MLD snooping. In IGMPv3 and MLDv2, reporfs are generated in response 
to queries. However, in earlier versions of fhese protocols, a reporf generafed by 
one hosf and heard by ofhers fhaf are group members on fhe same link cause fhe 
addifional members to suppress fheir reporfs. This can lead to a problem if IGS 
swifches were fo forward reporfs fo all affached interfaces, as hosfs on some LAN 
(VLAN) segmenfs wifh group members may nof be noficed. Thus, IGS swifches 
supporfing earlier versions of IGMP and MLD avoid broadcasfing reporfs ouf of 
all inferfaces. Insfead, fhey forward reporfs only fo fhe nearesf mulficasf router. 
Determining fhe locafion of mulficasf roufers is made easier if Mulficasf Roufer 
Discovery (MRD) is used (see Ghapfer 8). 

Anofher issue of concern when implemenfing snooping relafes fo fhe differ¬ 
ence in message formafs befween IGMP and MLD. Because MLD is encapsulafed 
as parf of IGMPv6 insfead of ifs own separafe profocol, MLD-snooping swifches 
musf process IGMPv6 informafion and be careful fo separafe fhe MLD messages 
from fhe ofhers. In parficular, ofher IGMPv6 fraffic musf be allowed fo flow freely 
for fhe various ofher funcfions for which IGMPv6 is used (see Ghapfer 8). 

Ofher nonsfandard propriefary profocols have been implemented fo furfher 
opfimize IP mulficasf fraffic carried fhrough layer 2 devices. For example, Gisco 
has proposed fhe Router-port Group Management Protocol (RGMP) [RFG3488]. In 
RGMP, a mechanism is employed so fhaf nof only do hosfs reporf fheir groups 
and sources of inferesf (as in IGMP/MLD), buf mulficasf roufers also do fhe same. 
This informafion is used fo opfimize layer 2 forwarding of mulficasf fraffic among 
mulficasf roufers (nof jusf hosfs). 


9.5 Attacks Involving IGMP and MLD 

Because IGMP and MLD are signaling profocols fhaf confrol fhe flow of mulficasf 
fraffic, affacks using fhese profocols primarily are eifher DoS affacks or resource 
ufilizafion affacks. There have also been affacks fhaf exploif buggy implemenfa¬ 
fion of fhe profocols, fo eifher disable hosfs or cause fhem fo execufe code provided 
by an affacker. 
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A simple DoS attack can be mounted by sending IGMP or MLD to subscribe 
to a large number of high-bandwidth multicast groups. Doing so can cause band¬ 
width exhaustion, leading to a denial of service. A more complex attack can be 
mounfed by generafing requesfs using a relafively low IP address. In fhis case, fhe 
attacker is elecfed fo be fhe querier for fhe link and can adverfise ifs own robusf- 
ness variable, query inferval, and maximum response fime fhaf will be adopfed 
by fhe ofher mulficasf roufers. If fhe maximum response fime is very small, hosfs 
are induced fo send reporfs rapidly, using CPU resources. 

Several attacks have been carried ouf by exploding implemenfafion bugs. 
Fragmenfed IGMP packefs have been used fo induce crashes in cerfain operaf- 
ing sysfems. More recenfly, specially craffed IGMP or MLD packefs using SSM 
informafion have been used fo induce remofe code execufion bugs. Overall, fhe 
impacf of IGMP or MLD vulnerabilifies fends fo be somewhaf less fhan wifh ofher 
profocols, as mulficasf fends fo be supporfed only in fhe local area. As a resulf, 
remofe affackers lacking on-link access fo fhe fargef LAN are likely fo be limifed. 


9.6 Summary 

Broadcasfing, generically, means sending fraffic fo all nodes on a nefwork. In fhe 
confexf of TCP/IP, broadcasfing means sending a packef fo all hosfs in a nefwork 
or subnefwork, fypically fhe locally attached nefwork. Mulficasfing refers fo send¬ 
ing fraffic fo only a subsef of nodes in a nefwork. In TCP/IP, mulficasfing means 
sending a packef fo a subsef of fhe inferesfed hosfs in fhe nefwork. The mefhod 
for selecfing fhe subsef is dependenf on fhe scope of fhe mulficasf fraffic and fhe 
inferesf of receivers. In many applicafions mulficasfing is beffer fhan broadcasf¬ 
ing, since mulficasfing imposes less overhead on hosfs fhaf are nof parficipafing 
in fhe communicafion. Broadcasfing is supporfed in IPv4 buf nof in IPv6. Broad¬ 
casfing and mulficasfing can be used fo avoid having fo send fhe same confenf fo 
mulfiple desfinafions by repeafedly using unicasf connecfions. If can also be used 
fo discover servers fhaf are ofherwise unknown. Mulficasfing is a more complex 
capabilify fhan broadcasfing, as sfafe musf be mainfained fo defermine which 
hosfs are inferesfed in which groups. 

In IPv4 fhere are fwo fypes of broadcasf addresses: limifed (255.255.255.255) 
and direcfed. The direcfed broadcasf address is based on fhe nefwork prefix and 
ifs lengfh and is formed by creafing a 32-bif address whose inifial bifs are equal 
fo fhe nefwork prefix and whose low-order bifs are sef fo 1. If is usually preferable 
fo use direcfed broadcasfs insfead of fhe limifed broadcasf address. Selecfion of 
which inferfaces are used fo send oufgoing broadcasf fraffic is operafing-sysfem- 
dependenf. A fypical case is fo use one primary inferface for limifed broadcasf 
fraffic and use fhe informafion presenf in fhe hosf's forwarding fable fo selecf fhe 
inferface for oufgoing direcfed broadcasfs and mulficasfs. 

Mulficasfing in IP supporfs a model whereby processes inferesfed in receiving 
mulficasf packefs subscribe fo a parficular group (using an IP address) on a sef of 
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interfaces. Transmitting multicast IPv4 traffic on multicast-capable IEEE link-layer 
networks (such as Ethernet) involves combining the low-order 23 bits of the group 
address with the prefix 01:00:5e to form a MAC-layer destination address used 
for link-layer multicasting. Transmitting IPv6 multicast traffic involves combining 
the lower-order 32 bits of the group address with the 16-bit prefix 33:33 to form 
a MAC-layer destination address. These mappings are nonunique, meaning that 
more than one IPv4 or IPv6 group address uses the same MAC-layer address. As a 
consequence, host software performs filtering of incoming traffic to remove traffic 
for unwanted groups. 

The ICMP and MED protocols are used with IPv4 and IPv6, respectively, in 
supporting multicast packet delivery. Multicast routers send query messages to 
nearby hosts in order to determine which hosts are interested in which groups, 
and (for IGMPv3 and MLDv2) which senders are of interest to these groups. Hosts 
respond by sending reports indicating the groups of interest. MLD is part of the 
ICMPv6 protocol, whereas ICMP is an independent protocol layered above IPv4 
(like ICMP). Some switches are equipped to "snoop" ICMP and MLD traffic in 
order to avoid sending multicast IP traffic along spanning tree branches where 
there are no interested receiving hosts. ICMP and MLD have a "robustness vari¬ 
able" that can be set to enable retransmissions of important messages on networks 
prone to loss. 

Because ICMP and MLD are both signaling protocols that control the flow of 
other traffic, attacks against them tend to cause extra resource consumption, pos¬ 
sibly leading to denial of service. Other forms of attacks that exploit implementa¬ 
tion bugs have also been seen and have been used to cause execution of unwanted 
code provided by an attacker. As MLD (and MLDv2) are relatively new in terms of 
deployment, it is likely that additional exploits will ultimately be found, but these 
protocols are limited in operation to a single link. 
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User Datagram Protocol (UDP) 
and IP Fragmentation 


10.1 Introduction 

UDP is a simple, datagram-oriented, transport-layer protocol that preserves mes¬ 
sage boundaries. It does not provide error correction, sequencing, duplicate elimi¬ 
nation, flow control, or congestion control. It can provide error detection, and it 
includes the first true end-to-end checksum at the transport layer that we have 
encountered. This protocol provides minimal functionality itself, so applicafions 
using if have a greaf deal of confrol over how packefs are senf and processed. 
Applicafions wishing fo ensure fhaf fheir dafa is reliably delivered or sequenced 
musf implemenf fhese profecfions fhemselves. Generally, each UDP oufpuf opera- 
fion requesfed by an applicafion produces exacfly one UDP dafagram, which 
causes one IP dafagram fo be senf. This is in confrasf fo a sfream-orienfed profocol 
such as TCP (see Chapfer 15), where fhe amounf of dafa wriffen by an applicafion 
may have liffle relafionship fo whaf acfually gefs senf in a single IP dafagram or 
whaf is consumed af fhe receiver. 

[RFC0768] is fhe official specificafion of UDP, and if has remained as a sfan- 
dard wifhouf significanf revisions for more fhan 30 years. As menfioned, UDP 
provides no error correcfion: if sends fhe dafagrams fhaf fhe applicafion wrifes 
fo fhe IP layer, buf fhere is no guaranfee fhaf fhey ever reach fheir desfinafion. In 
addifion, fhere is no profocol mechanism fo prevenf high-rafe UDP fraffic from 
negafively impacfing ofher nefwork users. Given fhis lack of reliabilify and pro- 
fecfion, we mighf be fempfed fo conclude fhaf fhere are no benefifs fo using UDP 
af all. This is nof frue, however. Because of ifs connecfionless characfer, if has less 
overhead fhan ofher fransporf protocols. In addifion, broadcasf and mulficasf 
operafions (see Chapfer 9) are much more sfraighfforward using a connecfionless 
fransporf such as UDP Finally, fhe abilify of an applicafion fo choose ifs own unif 
of refransmission can be an imporfanf considerafion (see [CT90], for example). 
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Figure 10-1 shows the encapsulation of a UDP datagram as a single IPv4 data¬ 
gram. The IPv6 encapsulation is similar, but other details differ slighfly and we 
discuss fhem in Secfion 10.5. The IPv4 Protocol field has fhe value 17 fo indicafe 
UDP. IPv6 uses fhe same value in fhe Next Header field. Lafer in fhis chapfer we 
will examine whaf happens when fhe size of fhe UDP dafagram exceeds fhe MTU 
size and fhe dafagram musf be fragmenfed info more fhan one IP-layer packef. 


UDP Datagram 


IPv4 Header 


UDP 

Header 


UDP Data 


(20 bytes) 
(No IP Options) 


(8 bytes) 

- IPv4 Datagram 


Figure 10-1 Encapsulation of a UDP datagram in a single IPv4 datagram (the typical case with no IPv4 
options). The IPv6 encapsulation is similar; the UDP header follows the header chain. 


10.2 UDP Header 

Figure 10-2 shows a UDP dafagram, including fhe payload and UDP header (which 
is always 8 byfes in size). 

Porf numbers acf as mailboxes and help a protocol implemenfafion idenfify fhe 
sending and receiving processes (see Chapfer 1). They are purely abstract —fhey do 
nof correspond fo any physical enfify on a hosf. In UDP, porf numbers are posifive 
16-bif numbers, and fhe source porf number is opfional; if may be sef fo 0 if fhe 
sender of fhe dafagram never requires a reply. Transporf profocols such as TCP, 
UDP, and SCTP [RFC4960] use fhe desfinafion porf number fo help demulfiplex 
incoming dafa from IP. Because IP demulfiplexes fhe incoming IP dafagram fo a 
parficular fransporf profocol based on fhe value of fhe Protocol field in fhe IPv4 
header or Next Header field in fhe IPv6 header, fhis means fhaf fhe porf numbers 
can be made independenf among fhe fransporf profocols. Thai is, TCP porf num¬ 
bers are used only by TCP, and fhe UDP porf numbers only by UDP, and so on. 
A sfraighfforward consequence of fhis separafion is fhaf fwo completely disfincf 
servers can use fhe same porf number and IP address, as long as fhey use differenf 
fransporf profocols. 


Note 

Despite this independence, if a weli-known service is provided (or can conceiv¬ 
ably be provided) by both TCP and UDP, the port number is normaliy aliocated 
to be the same for both transport protocois. This is pureiy for convenience and is 
not required by the protocols. See [IPORT] for details on how port numbers are 
formally assigned. 
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Figure 10-2 The UDP header and payload (data) area. The Checksum field is end-to-end and is 
computed over the UDP pseudo-header, which includes the Source and Destination IP 
Address fields from the IP header. Thus, any modification made to those fields (e.g., by 
NAT) requires a modification to the UDP checksum. 


Referring to Figure 10-2, the UDP Length field is the length of the UDP header 
and the UDP data in bytes. The minimum value for this field is 8 except when 
UDP is used with IPv6 jumbograms (see Section 10.5). Sending a UDP datagram 
with 0 bytes of data is acceptable, although rare. Note that the UDP Length field 
is redundant; the IPv4 header contains the datagram's total length (see Chapter 
5), and the IPv6 header contains the payload length. The length of a UDP/IPv4 
datagram is then the total length of the IPv4 datagram minus the length of the 
IPv4 header. A UDP/IPv6 datagram's length is the value of the Payload Length field 
contained in the IPv6 header minus the lengths of any extension headers (unless 
jumbograms are being used). In either case, the UDP Length field should match the 
length computed from the IP-layer information. 


10.3 UDP Checksum 

The UDP checksum is the first end-to-end transport-layer checksum we have 
encountered (ICMP has an end-to-end checksum but is not a true transport proto¬ 
col). It covers the UDP header, the UDP data, and a pseudo-header (defined later in 
this section). It is computed at the initial sender and checked at the final destination. 




476 User Datagram Protocol (UDP) and IP Fragmentation 


It is not modified in transit (except when it passes through a NAT, as described in 
Chapter 7). Recall that the checksum in the IPv4 header covers only the header 
(i.e., it does not cover any data in the IP packet) and is recomputed at each IP hop 
(required because the IPv4 TTL field is decremenfed by roufers when fhe dafa- 
gram is forwarded). Transporf protocols (e.g., TCP, UDP) use checksums fo cover 
fheir headers and dafa. Wifh UDP, fhe checksum is opfional (alfhough sfrongly 
suggested), while wifh fhe ofhers if is mandatory When UDP is used wifh IPv6, 
compufafion and use of fhe checksum are mandatory because fhere is no header 
checksum af fhe IP layer. To provide error-free dafa fo applicafions, a fransporf- 
layer profocol such as UDP musf always compute a checksum or use some ofher 
error defecfion mechanism before delivering fhe dafa fo a receiving applicafion. 

Alfhough fhe basics for calculafing fhe UDP checksum are similar fo whaf we 
described in Chapter 5 for fhe general Infernef checksum (fhe one's complemenf 
of fhe one's complemenf sum of 16-bif words), fhere are fwo small special defails. 
Firsf, fhe lengfh of fhe UDP dafagram can be an odd number of byfes, whereas 
fhe checksum algorifhm adds 16-bif words (always an even number of byfes). The 
procedure for UDP is fo append a (virfual) pad byfe of 0 fo fhe end of odd-lengfh 
dafagrams, jusf for fhe checksum compufafion and verificafion. This pad byfe is 
nof acfually fransmiffed and is fhus called "virfual" here. 

The second defail is fhaf UDP (as well as TCP) compufes ifs checksum over a 
12-byfe pseudo-header derived (solely) from fields in fhe IPv4 header or a 40-byfe 
pseudo-header derived from fields in fhe IPv6 header. This pseudo-header is also 
virfual and is used only for purposes of fhe checksum compufafion (af bofh fhe 
sender and fhe receiver). If is never acfually fransmiffed. This pseudo-header 
includes fhe source and desfinafion addresses and Protocol or Next Header field 
(which should confain fhe value 17) from fhe IP header. Ifs purpose is fo lef fhe 
UDP layer verify fhaf fhe dafa has arrived af fhe correcf desfinafion (i.e., fhaf IP 
has nof accepfed a misaddressed dafagram, and fhaf IP has nof given UDP a dafa¬ 
gram fhaf is for anofher fransporf profocol). Figure 10-3 shows whaf is covered 
when compufing fhe UDP checksum, including fhe pseudo-header along wifh fhe 
UDP header and payload. 


Note 

The careful reader will note that this causes a so-called layering violation. That 
is, the UDP protocol (transport layer) is directly processing bits “owned” by IP 
(network layer). While true, it is of only minor consequence to protocol implemen¬ 
tations, which in general have IP-layer information readily available when data 
is passed to (or from) UDP. It is of far greater concern for NATs (see Chapter 7), 
especially if UDP datagrams are fragmented. 


Figure 10-3 shows a datagram with an odd data length, requiring a pad byte 
for the checksum computation. Note that the length of the UDP datagram appears 
twice in the checksum computation. If the value of the calculated checksum 
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Figure 10-3 Fields used in computing the checksum for UDP/IPv4 datagrams, including the 
pseudo-header, the UDP header, and data. If the data is not an even number of bytes, it 
is padded with one 0 byte for purposes of computing the checksum. The pseudo-header 
and any pad bytes are not transmitted with the datagram. 


happens to be 0x0000, it is stored in the header as all 1 bits (OxFFFF), which is 
equivalent in one's complement arithmetic (see Chapter 5). Upon receipt, a Check¬ 
sum field value of 0x0000 indicafes fhaf fhe sender did nof compufe a checksum. 
If fhe sender did compufe a checksum and fhe receiver defecfs a checksum error, 
fhe UDP dafagram is silenfly discarded. No error message is generafed, alfhough 
some sfafisfical counfs may be updafed. (This is whaf happens if an IPv4 header 
checksum error is defecfed.) 

Despife UDP checksums being opfional in fhe original UDP specificafion, 
fhey are currenfly required fo be enabled on hosfs by defaulf [RFC1122]. During 
fhe 1980s some compufer vendors fumed off UDP checksums by defaulf fo speed 




478 User Datagram Protocol (UDP) and IP Fragmentation 


up their implementation of Sun's Network File System (NFS), which uses UDP. 
While this might not cause problems in many cases because of fhe presence of 
layer 2 CRC profecfion (which is sfronger fhan fhe Infernef checksum; see Chapfer 
3), if is considered bad form (and a violafion of fhe RFCs) fo disable checksums 
by defaulf. Early experience in fhe Infernef revealed fhaf when dafagrams pass 
fhrough roufers, all befs are off wifh respecf fo fheir correcfness. Believe if or nof, 
fhere have been roufers wifh soffware and hardware bugs fhaf have modified bifs 
in fhe dafagrams being forwarded. These errors are undefecfable in a UDP dafa- 
gram if fhe end-fo-end UDP checksum is disabled. Also realize fhaf some older 
dafa-link profocols (e.g., serial line IP, or SLIP) do nof have any form of dafa-link 
checksum, fhereby leaving open fhe possibilify fhaf IP packefs could be undefecf- 
ably modified unless anofher checksum is employed. 


Note 

[RFC1122] requires that UDP checksums be enabled by default. It also states that 
an implementation must verify a received checksum if the sender calculated one 
(i.e., if the received checksum is not 0). 


Given the structure of the pseudo-header, it is clear that when a UDP/IPv4 data¬ 
gram passes through a NAT, not only is the IP-layer header checksum modified, 
but the UDP pseudo-header checksum must be appropriately modified because 
the IP-layer addressing and/or UDP-layer port numbers may have changed. NATs 
therefore routinely perform "layering violations" by modifying multiple layers of 
protocol within packets at the same time. Of course, given that the pseudo-header 
is itself a layering violation, a NAT has little choice. The particular rules that apply 
when UDP traffic is processed by a NAT are given in [RFC4787]. We also dis¬ 
cussed them briefly in Chapter 7. 

Recently there has been interest in relaxation of the UDP checksum for appli¬ 
cations that are partially insensitive to errors (multimedia applications being the 
typical case). The discussion relates to whether having a partial checksum is a valu¬ 
able concept. A partial checksum covers only a portion of the payload specified by 
the application. We discuss this in Section 10.6 in the context of UDP-Lite. 


10.4 Examples 

We will use the sock program [SOCK] to generate some UDP datagrams that we 
can watch with tcpdump. In the first example, we are running a server on the 
discard port (9) on the destination machine. In the second example, we have dis¬ 
abled the server, and the client is informed of this fact as illustrated here. Very few 
UDP-based services are made available in typical machine configurations because 
of security concerns, so the second part of the example is not unusual. 
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Linux% sock -v -u -i 10.0.0.3 discard 

connected on 10.0.0.5.46274 to 10.0.0.3 
wrote 1024 bytes 

(1023 more times) 


Linux% sock -v -u -i 10.0.0.3 discard 

connected on 10.0.0.5.46294 to 10.0.0.3 
wrote 1 bytes 

write returned -1, expected 1024: Connection refused 

When we execute the sock program, we specify the verbose mode, -v, to see 
the ephemeral port numbers, specify UDP -u insfead of fhe defaulf TCP, and use 
fhe -i opfion fo send dafa insfead of frying fo read and wrife sfandard inpuf and 
oufpuf. The defaulf number of dafagrams (1024) is senf fo fhe desfinafion hosf wifh 
IP address 10.0.0.3. In fhis case we have arranged a server fo process incoming 
dafagrams fo fhe discard porf. To capfure fhe fraffic senf, we use fhe following 
command on a hosf wifh access fo fhe fraffic sfream: 


Linux# tcpdump -n -p -s 1500 -vw host 10.0.0.3 and \( udp or icmp \) 

This command capfures any UDP or ICMP fraffic befween fhe fwo machines (and 
possibly addifional fraffic nof illusfrafed). The -s 1500 opfion direcfs tcpdump 
fo collecf packefs up fo 1500 byfes in lengfh (longer fhan fhe 1024 byfes we are 
sending, in fhis case), and fhe -vvv opfion indicafes verbose prinfing. The -n 
opfion fells tcpdump fo nof converf IP addresses fo machine names, and fhe -p 
opfion avoids placing fhe defaulf nefwork inferface info promiscuous mode. The 
resulfing tcpdump oufpuf is illusfrafed in Lisfing 10-1 (some lines have been 
wrapped for clarify). 

Listing 10-1 tcpdump output showing packets from the first sock command (server running) 

1 22:52:53.102838 10.0.0.5.46274 > 10.0.0.3.9: 

[udp sum ok] udp 1024 (DF) (ttl 64, id 24462, len 1052) 

2 22:52:53.102964 10.0.0.5.46274 > 10.0.0.3.9: 

[udp sum ok] udp 1024 (DF) (ttl 64, id 24463, len 1052) 

3 22:52:53.103091 10.0.0.5.46274 > 10.0.0.3.9: 

[udp sum ok] udp 1024 (DF) (ttl 64, id 24464, len 1052) 

4 22:52:53.103215 10.0.0.5.46274 > 10.0.0.3.9: 

[udp sum ok] udp 1024 (DF) (ttl 64, id 24465, len 1052) 

. . . repeated 1020 times . . . 


This output shows four 1052-byte UDP/IPv4 datagrams (1024 bytes of UDP 
payload plus 8 bytes of UDP header and the 20-byte IPv4 header) sent from IPv4 
address 10.0.0.5 and port 46274 to port 9 (the discard port), with an inter¬ 
packet time of about lOOps. In addition, we may observe that UDP checksums 
are enabled and are valid (checked by tcpdump), that the Don't Fragment {DF) 
bit field is turned on, the IPv4 TTL field is 64, and the IPv4 Identification field is 
different (and increasing by 1) for each datagram. No ICMP traffic is generated. 
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and it would appear that all data was successfully delivered to the destination 
machine; although because there are no acknowledgments, we do not know with 
certainty. We shall see in Chapter 13 that the other major transport protocol, TCP, 
normally uses a handshake with the other end before fhe firsf byfe of dafa can be 
senf and uses subsequenf acknowledgmenfs fo know whaf dafa has been success¬ 
fully fransferred fo fhe receiver. 

The second fime we run fhe sock program wifh fhe same argumenfs, buf 
fhis fime we send our dafagrams fo fhe discard service affer fhe server has been 
disabled. Lisfing 10-2 shows fhe frace for fhis example (some lines have been 
wrapped for clarify). 


Listing 10-2 tcpdump output showing ICMP Destination Unreachable (Port Unreachable) message 
from host (server disabled) 

1 22:55:07.223094 10.0.0.5.46294 > 10.0.0.3.9: 

[udp sum ok] udp 1024 (DF) (ttl 64, id 37874, len 1052) 

2 22:55:07.223134 10.0.0.3 > 10.0.0.5: icmp: 

10.0.0.3 udp port 9 unreachable for 
10.0.0.5.46294 > 10.0.0.3.9: 

udp 1024 (DF) (ttl 64, id 37874, len 1052) 

[tos OxcO] (ttl 255, id 63302, len 576) 


In fhis example we see somewhaf differenf behavior. Here, only a single UDP 
dafagram is senf, and an ICMP message is refurned in response. Alfhough all 
ofher paramefers are fhe same, no server is running fo receive fhe incoming dafa¬ 
grams. In fhis case, fhe underlying UDP implemenfafion causes an ICMPv4 Des- 
finafion Unreachable (Porf Unreachable) message (see Chapfer 8) fo be generafed 
and refurned fo fhe sender. This message includes a copy of fhe firsf 556 byfes 
of fhe original ("offending") dafagram. If fhe ICMP message is nof discarded by 
fhe infervening nefwork (accidenfally or on purpose by firewalls), fhe sending 
applicafion (if if is sfill running when fhe ICMP message arrives) can learn of 
fhe absence of fhe receiver and prinf an error, as indicafed in fhe lisfing af fhe 
beginning of fhis secfion (i.e., fhe write returned -1 message). Nofe fhaf fhe 
refurning ICMP error message confains enough informafion for fhe sending hosf 
fo ascerfain which porf was nof reachable. Finally, nofe fhaf fhe source UDP porf 
number changes each fime fhe program is run. Firsf if was 46274 and fhen if 
was 46294. We menfioned in Chapfer 1 fhaf fhe ephemeral porf numbers used by 
clienfs are suggesfed fo be in fhe range 49152 fhrough 65535, so here we observe 
noncomplianf behavior. 


Note 

For Linux, the local port parameter range can be easily modified by changing the 
contents of the file /proc/sys/net/ipv4/ip_local_port_range. In Win¬ 
dows Vista and later, the netsh command can be used to set the dynamic port 
range [KB929851]. See [IPORT] for current port numbers. 



Section 10.5 UDP and IPv6 


481 


0 15 16 31 


Source IPv6 Address 


(16 bytes; derived from IPv6 header) 


Destination IPv6 Address 


(16 bytes; derived from IPv6 header) 


1 Length (4 bytes) | 

Reserved (0) 

Next Header (1 byte) 


Pseudo- 
Header 
(40 bytes) 


Figure 10-4 The UDP (and TCP) pseudo-header used with IPv6 ([RFC2460]). The pseudo-header 
includes the source and destination IPv6 addresses and a larger 32-bit Length field 
value. The pseudo-header checksum is required when UDP is used with IPv6 because 
the IPv6 header lacks a checksum. The Next Header field is copied from the last IPv6 
header of the chain. 


10.5 UDP and IPv6 

Given its simplicity, UDP requires only small changes when operating over IPv6 
instead of IPv4. The most obvious differences are fhe 128-bif addresses used by 
IPv6 and fhe corresponding effecf on fhe consfrucfion of fhe pseudo-header. A 
relafed buf more subfle disfincfion is fhaf in IPv6, no IP-layer header checksum is 
presenf. Thus, if UDP were fo operafe wifh checksums disabled, fhere would be no 
end-to-end check whatsoever on fhe correcfness of fhe IP-layer addressing informa- 
fion. For fhis reason, when UDP is used wifh IPv6, a pseudo-header checksum, 
common fo bofh UDP and TCP, is required (by Secfion 8 of [RFC2460]). The con¬ 
sfrucfion of fhe pseudo-header is given in Figure 10-4. Nofe fhaf fhe Length field 
has expanded from ifs IPv4 counferparf fo 32 bifs. Recall from earlier fhaf fhis field 
is redundanf for UDP, buf we shall see in Chapfer 13 fhaf if is nof redundanf when 
used wifh TCP (eifher TCP/IPv4 or TCP/IPv6) and has fhus been refained for use 
wifh bofh UDP/IPv6 and TCP/IPv6. 

Expanding fhe discussion regarding fhe IPv6 packef lengfh, fwo aspecfs of 
IPv6's packef size can affecf UDP. Firsf, in IPv6, fhe minimum MTU size is 1280 
byfes (as opposed fo fhe 576 byfes required by IPv4 as fhe minimum size required 
fo be supporfed by all hosfs). Second, IPv6 supporfs jumhograms (packefs larger 
fhan 65,535 byfes). If we inspecf fhe IPv6 header and opfion sef (see Chapfer 5), 
we can observe fhaf wifh jumhograms, 32 bifs are available fo hold fhe payload 
lengfh. This implies fhaf a single UDP/IPv6 dafagram could be very large indeed. 
As described in [RFC2675], fhis poses a problem for fhe UDP Length field in fhe 
UDP header, which is only 16 bifs long. As such, when encapsulafed in IPv6, a 
UDP/IPv6 dafagram exceeding 65,535 byfes has ifs UDP Length field value sef fo 
0. Nofe fhaf fhe size of fhe Length field in fhe pseudo-header is sfill large enough 
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(32 bits). Computing the value of this field for IPv6 jumbograms involves faking 
fhe fofal lengfh of fhe UDP header plus dafa. Checking fhis field when receiv¬ 
ing a packef involves compufing fhe size of fhe UDP dafagram (header plus dafa) 
by subfracfing fhe size of all IPv6 exfension headers from fhe value found in fhe 
Jumbo Payload opfion, which gives fhe lengfh of fhe IPv6 payload (i.e., fhe fofal 
dafagram lengfh minus fhe 40-byfe IPv6 header). In fhe "unexpecfed" case where 
fhe Length field in fhe UDP header is 0 buf no Jumbo Payload opfion is presenf, fhe 
UDP lengfh can be inferred based on fhe nonzero IPv6 Payload Length field (see 
Secfion4of [RFC2675]). 

10.5.1 Teredo: Tunneling IPv6 through IPv4 Networks 

Although it was once thought that a worldwide transition to IPv6 might hap¬ 
pen quickly, this has not materialized exactly as forecast. Consequently, a num¬ 
ber of (theoretically temporary) transition mechanisms [RFC4213][RFC5969] have 
been proposed to ease the transition burden. One such mechanism is called 6to4 
[RFC3056], whereby IPv6 packets used by hosts are encapsulated in IPv4 packets 
that may be delivered over an IPv4-only infrastructure. One problem with 6to4 is 
that it suffers from the same types of NAT traversal problems as other applications 
on the Internet. It is also known to have scaling problems that make its contin¬ 
ued use unattractive [RFC6343]. Although methods we have seen such as ICE (see 
Chapter 7) could conceivably be used for handling this issue, a special protocol 
called Teredo (originally called "shipworm" but renamed based on the Latin name 
for a common genus of shipworm to avoid confusion with computer worms) has 
been devised especially to address this problem [RFC4380][RFC5991][RFC6081]. It 
is popular because of its widespread availability in modern versions of Microsoft 
Windows. 

Teredo (also called Teredo tunneling) transports IPv6 datagrams in the payload 
area of UDP/IPv4 datagrams for systems that have no other IPv6 connectivity 
options. An example scenario is given in Figure 10-5. Teredo clients are IPv4/IPv6 
hosts that implement a Teredo tunneling interface. Such interfaces are assigned 
special Teredo addresses using the 2001::/32 IPv6 prefix after having successfully 
engaged in a "qualification" procedure, described in the next paragraph. Teredo 
servers, which serve a general purpose similar to STUN servers (Chapter 7), are 
used to help establish direct tunnels of Teredo-encapsulated IPv6 packets through 
NATs. Teredo relays serve a purpose similar to TURN servers and consequently 
may take significant processing resources if used by many clients. Note that serv¬ 
ers must include all of the capabilities of relays, but not vice versa. Using Teredo 
relays is a "last-resort" option for IPv6 connectivity. Nodes cease to perform Teredo 
tunneling if they discover that they have any other IPv6 connectivity option (e.g., 
direct or using 6to4). 

Referring to Figure 10-5, a Teredo client is initially configured with the name 
or IPv4 address and UDP port number (usually 3544) of a Teredo server. Teredo 
was initially developed by Microsoft, and a Teredo server is available using the 
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Qualification 

Procedure 



Figure 10-5 Teredo, an IPv6 transition mechanism, encapsulates IPv6 datagrams and optional trail¬ 
ers within the payload area of UDP/IPv4 datagrams to carry IPv6 traffic across IPv4- 
only infrastructures. The server helps clients obtain an IPv6 address and determine 
their mapped addresses and port numbers. Relays, if required, can forward traffic 
between Teredo, 6to4, and native IPv6 clients. 


name teredo.ipv6.microsoft.com. When ready to obtain an address, it starts 
the qualification procedure. The client begins by sending an ICMPv6 RS packet (see 
Chapter 8) from one of ifs link-local IPv6 addresses using ifs Teredo service porf, 
fhe agenf responsible for encapsulafing and decapsulafing IPv6 fraffic wifhin 
UDP/IPv4. The encapsulafion formal is fhe Origin Indicafion formal, one of fwo 
shown in Figure 10-6. 

Successful responses are ICMPv6 RA messages fhaf use fhe Origin Indica¬ 
fion Encapsulafion formal from Figure 10-6. The RA confains a Prefix Informafion 
opfion wifh a valid Teredo prefix (see Chapfer 2). The Origin Indicafion provides 
fhe clienf wifh knowledge of ifs own mapped address and porf informafion. The 
source address of fhe RA is a valid link-local IPv6 address of fhe server. The desfi- 
nafion is fhe clienf's link-local IPv6 address used as fhe source of fhe RS message. 
Assuming fhaf all goes well, fhe clienf is now "qualified" and can build ifs Teredo 
IPv6 address based on fhe prefix and origin informafion provided by fhe server. 
The Teredo address is an IPv6 address consfrucfed from various paramefers using 
fhe formal of Figure 10-7. 

A Teredo address (see Figure 10-7) confains fhe Teredo prefix (2001::/32), fhe 
IPv4 address of fhe Teredo server, a 16-bif Flags field defailed in fhe nexf para¬ 
graph, followed by fhe mapped porf number and mapped IPv4 address. The lasf 
fwo values are fhe addressing informafion of fhe clienf as seen from fhe Teredo 
server and are usually defermined by fhe clienf's oufermosf NAT. The acfual 
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Origin Indication 
Encapsulation 


Zero (0) 

Origin Port 
Number 

Origin (Mapped) 

IPv4 Address 


IPv4 Header 

(20 bytes, no options) 


UDP Header 

(8 bytes) 


Origin Indication 

(8 bytes, present only with Origin Indication Encapsulation) 


Encapsulated 
IPv6 Datagram 



UDP 

Payload 


Simple Encapsulation 

(Unless Origin Indication Field Is Present) 


Figure 10-6 The Simple Encapsulation and Origin Indication Encapsulation formats used by Teredo. 

The Origin Indication Encapsulation carries UDP address and port number informa¬ 
tion between the UDP header and encapsulated IPv6 datagram. This information is 
used to inform Teredo clients about their mapped addresses and port numbers when 
creating a Teredo address. Addresses and port numbers are "obfuscated" by inverting 
each bit present to fend off NATs that attempt to rewrite this information. Zero or more 
trailers may be present, encoded as TLV triples. They are used to implement a number 
of Teredo extensions (e.g., support for symmetric NATs). 


- 128-bit IPv6 Address- 


Teredo IPv6 Prefix 

Server IPv4 Address 

Flags 

Mapped 
Port 
(16 bits) 

Mapped IPv4 Address 

(32 bits) 

(32 bits) 

(16 bits) 

(32 bits) 


2001 ::/32 


Specified by 
[RFC5991] 


Teredo Server 


Client’s (Mapped) Address 
(Obscured by Complementing Each Bit) 


Randomi 
(4 bits) 


Random2 
(8 bits) 


Figure 10-7 Teredo clients use IPv6 addresses from the 2001::/32 Teredo prefix. The subsequent 
bits contain the Teredo server's IPv4 address, 16 flag bits that identify the type of NAT 
encountered and random bits to help thwart address-guessing attacks, and 16 bits con¬ 
taining the client's mapped port number and the client's mapped 32-bit IPv4 address. 
The last two values are "obfuscated." 
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address and port number information is bitwise-inverted to cause indiscriminate 
NATs to not rewrite them. 

The 16-bit Flags field has been used fo indicafe fhe type of NAT discovered 
during fhe qualificafion process. Some NATs (formerly called symmefric NATs— 
fhe fypes of NATs fhaf have eifher address-dependenf mapping or address- and 
porf-dependenf mapping along wifh eifher address-dependenf or address- and 
porf-dependenf filfering behavior) work wifh Teredo only when exfensions are 
supporfed (see lafer in fhis secfion), buf fhe mosf common fypes for household 
nefworks (including "cone NATs"—NATs wifh endpoinf-independenf mapping 
and endpoinf-independenf filfering behavior) work wifhouf such exfensions. 
Originally, fhe C (cone NAT) bif field was used fo indicafe if a cone NAT was 
encounfered and fo arrange appropriafe supporf, buf fhis usage is now deprecafed 
and fhe field should be sef fo 0 (clienfs ignore fhe field; servers inspecf if fo look 
for legacy clienfs). The nexf bif field is sef fo 0. The U (Universal) and G (Group) bif 
fields are available for fufure use buf are also currenfly sef fo 0. The Randoml and 
Random! field values are chosen as random numbers according fo [RFC5991] fo 
make Teredo addresses harder fo guess (a securify measure infended fo reduce 
random probes by pofenfial affackers). 

Once a qualified clienf builds ifs Teredo address, if can send IPv6 fraffic. For 
defails on whaf happens when qualificafion fails or a secure qualificafion is fo be 
used, see [RFC4380]. In general, a Teredo clienf may wish fo communicafe wifh 
anofher clienf on fhe same link, anofher clienf wifhin fhe IPv4 Infernef, or wifh 
a hosf on fhe IPv6 Infernef. In each case. Teredo provides some UDP/IPv4-based 
alfernafive fo IPv6 ND. For clienfs on fhe same link. Teredo uses an IPv4 mulficasf 
discovery protocol fhaf operafes using fhe mulficasf address 224.0.0.253. Special 
Teredo "bubble" packefs (fhose wifh no dafa payload) are used fo defermine if 
a desfinafion is on fhe same link. Such bubbles appear as minimum-size Teredo 
packefs using fhe Simple Encapsulafion formal of Figure 10-6. They confain an 
IPv6 header wifh fhe Destination IP Address field sef fo fhe fargef of fhe commu- 
nicafion. The IPv6 packef confains an IPv6 header wifh no payload or addifional 
exfensions (fhe Next Header field is sef fo 0x3b, indicafing none). For clienfs wifhin 
fhe IPv4 Infernef, recall fhaf fhe Teredo IPv6 address confains fhe IPv4-mapped 
address and porf number. Thus, if is sfraighfforward for one clienf fo send a Teredo- 
encapsulafed packef fo anofher's NAT. For NATs fhaf are resfricfive. Teredo uses 
bubble packefs fo perform hole punching and esfablish UDP NAT mappings (see 
Chapfer 7 and [RFC6081]). 

When a qualified clienf has a packef fo send fo an IPv6 hosf (i.e., one fhaf does 
nof use a Teredo address), if firsf defermines whefher if already knows a Teredo 
relay for fhe desfinafion. If so, fhe packef is senf using Simple Encapsulafion. If nof, 
fhe clienf formafs an ICMPv6 Echo Requesf confaining a large (e.g., 64-bif) random 
number and sends if fo fhe IPv6 desfinafion by way of fhe Teredo server. The 
server forwards fhis packef fo fhe desfinafion IPv6 hosf. The receiving hosf sees an 
incoming IPv6 dafagram wifh fhe source address equal fo fhe Teredo address of 
fhe clienf. If forms an Echo Reply, which is routed fo fhe nearesf Teredo relay. The 
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relay then forwards the reply back to the client. The receiving client observes the 
IPv4 address of the relay and updates a cache to indicate that subsequent packets 
destined for the IPv6 host should use the relay address it just determined. 

As of [RFC6081], Teredo can support a number of optional extensions, several 
of which help to support Teredo operation with symmetric NATs. The extensions 
are protocol behavior modifications and include the following: Symmetric NAT 
Support (SNS), UPnP-Enabled Symmetric NAT (UP), Port-Preserving Symmetric 
NAT (PP), Sequential Port-Symmetric NAT (SP), Hairpinning (HP), and Server 
Load Reduction (SLR). The extensions can be used independently, except that both 
the UP and PP extensions depend on the SNS extension. The various NAT types 
that can be supported with various extension combinations are given in a table 
(see Section 3 of [RFC6081]). 

To implement the extensions, one or more trailers may be present in a Teredo 
message. Trailers are encoded as an ordered list of TLV combinations, using the 
same basic format as for ICMPv6 ND options (Figure 8-41), which contain an 8-bit 
Type field and an 8-bit Length field. The two highest-order bits of the Type field 
encode what processing should be performed if the host does not recognize the 
trailer type. The bit pattern 01 indicates that the host should discard the packet; 
all others indicate that the unknown trailer should be skipped and others should 
be processed in order. The official list of trailer type values is maintained by the 
I ANA [TTYPES]. The trailers currently defined are listed here in Table 10-1. 


Table 10-1 Teredo trailers are carried after the IPv6 payload encapsulated in a UDP/IPv4 datagram. Each 
trailer has a type value, name, and associated explanation. In some cases, the length value is a 
constant. 


Type 

Length 

Name 

Use 

Notes 


Reserved 

(Unassigned) 

(Unassigned) 

(Unassigned) 


0x04 

Nonce 

SNS, UP, PP, SP, 
HP 

32-bit nonce for protection against 
replays (see Chapter 18) 


Reserved 

(Unassigned) 

(Unassigned) 

(Unassigned) 

0x03 

[8, 26] 

Alternate Address 

HP 

Additional addresses/ports usable 
by Teredo clients behind the same 
NAT 

0x04 

0x04 

ND Option 

SLR 

Allows NAT refresh using direct 
bubbles (that carry NS messages) 

0x05 

0x02 

Random Port 

PP 

Sender's predicted mapped port 


The Nonce trailer contains a 32-bit random value that is unique for each 
message. It is a security measure to guard against replay attacks (see Chapter 18) 
and is used with either HP or SNS (IPv4 address, port) pairs. Each pair is 6 bytes 
long, and the trailer can hold from one to four such pairs. These pairs identify 
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UDP/IPv4 endpoints that other Teredo clients on the same side of a NAT can use 
to contact the sender, and they are used with the HP extension. 

The ND Option trailer includes 1 byte that indicates either TeredoDiscovery- 
Solicitation (0x00) or TeredoDiscoveryAdvertisement (0x01). In the first case, the 
receiver is requested to respond with a direct bubble (i.e., sent directly between 
Teredo clients) containing the second form of message. The TeredoDiscoveryAd¬ 
vertisement type is the response. This trailer is used in supporting the SLR exten¬ 
sion, which effectively allows NS/NA messages carried in direct bubbles to be 
used for refreshing NAT state instead of indirect bubbles, which require process¬ 
ing by servers. Finally, the Random Port trailer contains a 16-bit UDP port number, 
which is the sender's best guess as to its mapped port number. This is used by the 
PP extension (see Section 6.3 of [RFC6081]). 


10.6 UDP-Lite 

Some applications are tolerant of bit errors that may be introduced in the data they 
send and receive. Often, these types of applications wish to use UDP in order to 
avoid connection setup overhead or to use broadcast or multicast addressing, but 
UDP uses a checksum that covers either the entire payload or none of it (i.e., when 
no checksum is computed by the sender). A protocol called UDP-Lite or UDPLite 
[RFC3828] addresses this issue by modifying the conventional UDP protocol to 
provide partial checksums. Such checksums cover only a portion of the payload 
in each UDP datagram. UDP-Lite has its own IPv4 Protocol and IPv6 Next Header 
field value (136), so it effectively counts as a separate transport protocol. UDP-Lite 
modifies the UDP header by replacing the (redundant) Length field with a Check¬ 
sum Coverage field (see Figure 10-8). 


0 


15 16 


31 


Source Port Number 

Destination Port Number 

(2 bytes) 

(2 bytes) 

Checksum Coverage 

Checksum 

(2 bytes) 

(2 bytes) 


UDP-Lite 
Header 
{8 bytes) 


Figure 10-8 UDP-Lite includes a Checksum Coverage field that gives the number of bytes (starting 
with the first byte of the UDP-Lite header) covered by the checksum. The minimum 
value is 0, indicating that the whole datagram is covered. Values 1 through 7 are invalid, 
as the header is always covered. UDP-Lite uses a different IPv4 protocol number (136) 
from UDP (17). IPv6 uses the same values in the Next Header field. 
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The Checksum Coverage field in Figure 10-8 is the number of bytes (starting 
from the first byte of the UDP-Lite header) covered by the checksum. Except for 
the special value 0, the minimum value is 8, because the UDP-Lite header itself 
is always required to be covered by the checksum. The value 0 indicates that the 
entire payload is covered by the checksum, as with conventional UDP. There is a 
slight issue with IPv6 jumbograms because of the limited space used to hold the 
Checksum Coverage field. For such datagrams, the number of bytes covered can 
be at most 64KB or the entire datagram (i.e., when the Checksum Coverage field 
has value 0). Special socket API options are used for applications to specify the 
use of UDP-Lite (IPPROTO_UDPLITE) and the amount of checksum coverage 
requested (using the SOL_UDPLITE, UDPLITE_SEND_CSCOV, and UDPLITE_ 
RECV_CSCOV options to setsockopt). 

10.7 IP Fragmentation 

As we described in Chapter 3, link-layer framing normally imposes an upper limit 
on the maximum size of a frame that can be transmitted. To keep the IP datagram 
abstraction consistent and isolated from link-layer details, IP employs fragmen¬ 
tation and reassembly. Whenever the IP layer receives an IP datagram to send, it 
determines which local interface the datagram is to be sent over next (via a for¬ 
warding table lookup; see Chapter 5) and what MTU is required. IP compares 
the outgoing interface's MTU with the datagram size and performs fragmentation 
if the datagram is too large. Fragmentation in IPv4 can take place at the origi¬ 
nal sending host and at any intermediate routers along the end-to-end path. Note 
that datagram fragments can themselves be fragmented. Fragmentation in IPv6 is 
somewhat different because only the source is permitted to perform fragmenta¬ 
tion. We saw an example of IPv6 fragmentation in Chapter 5. 

When an IP datagram is fragmented, it is not reassembled until it reaches its 
final destination. Two reasons have been given for this, the second more compel¬ 
ling than the first. First, not performing reassembly within the network alleviates 
the forwarding software (or hardware) in routers from implementing this feature. 
Second, it is possible for different fragments of the same datagram to follow differ¬ 
ent paths to their common destination. If this happens, no single router along the 
path would in general be capable of reassembling the original datagram because 
it would see only a subset of the fragments. The first argument is not terribly con¬ 
vincing at face value given the current performance levels of routers, but it is even 
less convincing when one considers that most routers must ultimately be capable 
of functioning as end hosts anyhow (e.g., when being managed or configured). 
The second argument remains compelling. 

10.7.1 Example: UDP/IPv4 Fragmentation 

An application using UDP may need to worry about the size of the resulting IP 
datagram it creates if it wishes to avoid IP-layer fragmentation. In particular, if 
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the size of the resulting datagram exceeds the link's MTU, the IP datagram is split 
across multiple IP packets, which can lead to performance issues because if any 
fragment is lost, the entire datagram is lost. Figure 10-9 illustrates the situation when 
a 3020-byte UDP/IPv4 datagram is split into multiple IPv4 packets. 


IPv4 Datagrams 


IPv4 

UDP 

Header 

Header 


UDP Data 


Offset = 0 

MF = 1 -IP Payload (1480 bytes) 


First Fragment 
Total Length = 1500 
UDP Length = 3000 


IPv4 

Header 


UDP Data 


Second Fragment 
Total Length = 1500 


Offset = 185 
MF = 1 


IPv4 

Header 


UDP Data 


Offset = 370-*- 

MF = 0 IP Payload 

(40 bytes) 


Last Fragment 
Total Length = 60 


Original IPv4 Datagram Total Length: 
20 + 8 + 2992 = 3020 bytes 


Figure 10-9 A single UDP datagram with 2992 UDP payload bytes is fragmented into three UDP/ 
IPv4 packets (no options). The UDP header that contains the source and destination 
port numbers appears only in the first fragment (a complicating factor for firewalls and 
NATs). Fragmentation is controlled by the Identification, Fragment Offset, and More Frag¬ 
ments (MF) fields in the IPv4 header. 


In Figure 10-9, we conclude that the original UDP datagram included 2992 
bytes of application (UDP payload) data and 8 bytes of UDP header, resulting in 
an IPv4 Total Length field value of 3020 bytes (recall that this size includes a 20-byte 
IPv4 header as well). When this datagram was fragmented into three packets, 40 
extra bytes were created (20 bytes for each of the newly created IPv4 fragment 
headers). Thus, the total number of bytes sent is 3060, an increase in IP-layer over¬ 
head of about 1.3%. The Identification field value (set by the original sender) is 
copied to each fragment and is used to group them together when they arrive. The 
Fragment Offset field gives the offset of the first byte of the fragment payload byte 
in the original IPv4 datagram (in 8-byte units). Clearly, the first fragment always 
has offset 0. Here, we observe the second fragment with offset 185 (185 8 = 1480). 
The size of 1480 is the size of the first fragment less the size of the IPv4 header. A 
similar analysis applies to the third fragment. Finally, the MF bit field indicates 
whether more fragments in the datagram should be expected and is 0 only in the 
final fragment. When the fragment with MF = 0 is received, the reassembly pro¬ 
cess can ascertain the length of the original datagram, as a sum of the Fragment 
Offset field value (times 8) and the IPv4 Total Length field value (minus the IPv4 
header length). Because each Offset field is relative to the original datagram, the 
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reassembly process can handle fragments that arrive out of order. When a data¬ 
gram is fragmented, the Total Length field in the IPv4 header of each fragment is 
changed to be the total size of that fragment. 

Although IP fragmentation looks transparent, there is one feature mentioned 
earlier that makes it less than desirable: if one fragment is lost, the entire data¬ 
gram is lost. To understand why this happens, realize that IP itself has no error 
correction mechanism of its own. Mechanisms such as timeout and retransmis¬ 
sion are left as the responsibility of the higher layers. (TCP performs timeout and 
retransmission; UDP does not. Some UDP-based applications perform timeout and 
retransmission themselves, but this happens at a layer above UDP.) When a frag¬ 
ment of a TCP segment is lost, TCP retransmits the entire TCP segment, which cor¬ 
responds to an entire IP datagram. There is no way to resend only one fragment of 
a datagram. Indeed, if the fragmentation was done by an intermediate router, and 
not the originating system, there is no way for the originating system to know how 
the datagram was fragmented. For this reason, fragmentation is often avoided. 
[KM87] provides arguments for avoiding fragmentation. 

Using UDP, it is easy to generate IP fragmentation. (We shall see later that 
TCP tries to avoid fragmentation and that it is nearly impossible for an application 
to force TCP to send segments large enough to require fragmentation.) We can 
use our sock program and increase the size of the datagram until fragmentation 
occurs. On an Ethernet, the maximum amount of data in a frame is ordinarily 1500 
bytes (see Chapter 3), which leaves at most 1472 bytes for application data to avoid 
fragmentation, assuming 20 bytes for the IPv4 header and 8 bytes for the UDP 
header.^ We will run our sock program with data sizes of 1471, 1472, 1473, and 
1474 bytes. We expect the last two to cause fragmentation: 

Linux% sock -u -i -nl “Wl471 10.0.0.3 discard 

Linux% sock -u -i -nl “Wl472 10.0.0.3 discard 

Linux% sock -u -i -nl -wl473 10.0.0.3 discard 

Linux% sock -u -i -nl “Wl474 10.0.0.3 discard 


Listing 10-3 illustrates the tcpdump output (some lines are wrapped for 
clarity). 


Listing 10-3 UDP fragmentation on a 1500-byte MTU Ethernet link 

1 23:42:43.562452 10.0.0.5.46530 > 10.0.0.3.9: 

udp 1471 (DF) (ttl 64, id 61350, len 1499) 

2 23:42:50.267424 10.0.0.5.46531 > 10.0.0.3.9: 

udp 1472 (DF) (ttl 64, id 62020, len 1500) 

3 23:42:57.814555 10.0.0.5 > 10.0.0.3: 

udp (frag 37671:101480) (ttl 64, len 21) 

4 23:42:57.814715 10.0.0.5.46532 > 10.0.0.3.9: 

udp 1473 (frag 37671:148000+) (ttl 64, len 1500) 


1. Recall the assumption that no options are used. For IPv4 datagrams with options, the header 
exceeds 20 bytes, up to a maximum of 60 bytes. 
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5 23:43:04.368677 10.0.0.5 > 10.0.0.3: 

udp (frag 37672:201480) (ttl 64, len 22) 

6 23:43:04.368838 10.0.0.5.46535 > 10.0.0.3.9: 

udp 1474 (frag 37672:148000+) (ttl 64, len 1500) 


The first two UDP datagrams (packets 1 and 2) fit into 1500-byte Ethernet 
frames (using the typical "DIX" or "Ethernet" encapsulation) and are not frag¬ 
mented. In the third case, the length of the IPv4 datagram corresponding to the 
application write of 1473 bytes is 1501, which must be fragmented (packets 3 and 
4). Similarly, the datagram generated by the write of 1474 bytes is 1502 bytes long 
and is also fragmented (packets 5 and 6). 

When it captures a fragmented datagram, tcpdump prints additional infor¬ 
mation. Pirst, the outputs frag 37671 (packets 3 and 4) and frag 37672 (packets 
5 and 6) specify the value of the Identification field in the IPv4 header. The next 
number in the fragmentation information (between the colon and the @ sign in 
packets 4 and 6) is the IPv4 packet size, excluding the IPv4 header. The first frag¬ 
ment of both datagrams contains 1480 bytes of data: 8 bytes for the UDP header 
and 1472 bytes of user data. (The 20-byte option-free IPv4 header makes the packet 
exactly 1500 bytes.) The second fragment of the first fragmented datagram (packet 
3) contains 1 byte of data (the remaining byte of user data). The second fragment 
of the second fragmented datagram (packet 5) contains the remaining 2 bytes of 
user data. Pragmentation requires that the data portion of the generated fragments 
(that is, everything excluding the IPv4 header) be a multiple of 8 bytes for all frag¬ 
ments other than the final one. In this example, 1480 is a multiple of 8. (Constrast 
this case with the IPv6 fragmentation example in Chapter 5, where the 1500-byte 
Ethernet MTU was not able to be fully utilized.) 

The number following the @ is the offset of the data in the fragment from the 
start of the datagram. The first fragment of each new fragmented datagram starts 
with offset 0 (packets 4 and 6), and the second fragment of both datagrams starts 
at byte offset 1480 (packets 3 and 5). The + sign following an offset value means 
that there are more fragments composing this datagram, corresponding to the MF 
bit field being set to 1 in the 3-bit Flags field in the IPv4 header. 

One observation that may be surprising is that the fragments with larger off¬ 
sets are delivered prior to the first fragments. In effect, the sender has intentionally 
reordered the fragments. Upon reflection, we realize that this behavior can be 
beneficial. If the last fragment is delivered first, the receiving host is able to ascer¬ 
tain the maximum amount of buffer space it will require in order to reassemble 
the entire datagram. Given that the reassembly process is robust to reordering 
anyhow, this presents no major problem. On the other hand, there are techniques 
that would like to take advantage of higher-layer information available in the first 
fragment (including UDP port numbers) that is not present in the later fragments 
[KEWG96]. 

Pinally, note that packets 3 and 5 (fragments other than the first) omit the 
source and destination UDP port numbers. In order for tcpdump to print the port 
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numbers associated with fragments other than the first, it would have to reas¬ 
semble fragmented datagrams to recover the port numbers that appear only in the 
UDP header located in the first fragments (which it does not do). 


10.7.2 Reassembly Timeout 

The IP layer must start a timer when any fragment of a dafagram firsf arrives. If 
fhis were nof done, fragmenfs fhaf never arrive (as we see in Lisfing 10-4) could 
evenfually cause fhe receiver fo run ouf of buffers and can consfifufe a form of 
affack opporfunify. The example in fhe lisfing was creafed wifh a special program 
fhaf consfrucfs and sends only fhe firsf fwo fragmenfs of an ICMPv4 Echo Requesf 
message separafed by a delay buf fhen never sends any addifional fragmenfs. Lisf¬ 
ing 10-4 illusfrafes fhe response (some lines have been wrapped for clarify). 


Listing 10-4 Timeout during IPv4 fragment reassembly 

1 17:35:59.609387 10.0.0.5 > 10.0.0.3: 

icmp: echo request (frag 28519:38000+) (ttl 255, len 400) 

2 17:36:19.617272 10.0.0.5 > 10.0.0.3: 

icmp (frag 28519:3800376+) (ttl 255, len 400) 

3 17:36:29.602373 10.0.0.3 > 10.0.0.5: 

icmp: ip reassembly time exceeded for 10.0.0.5 > 10.0.0.3: 

icmp: echo request (frag 28519:38000+) (ttl 255, len 400) 
[tos OxcO](ttl 64, id 38816, len 424) 


Here we see fhaf fhe firsf fragmenf (in bofh fime and sequence space) is senf, 
wifh fofal lengfh 400. A second fragmenf is senf 20s lafer, buf no final fragmenf 
is ever senf. Thirfy seconds affer receiving fhe firsf fragmenf, fhe fargef machine 
responds wifh an ICMPv4 Time Exceeded (code 1) message, felling fhe sender 
fhaf fhe dafagram has been discarded by including a copy of fhe firsf fragmenf. A 
normal fimeouf value is 30 or 60s. As we can see, fhe fimer sfarfs when any of fhe 
fragmenfs is received and is nof resef when new fragmenfs arrive. Thus, fhe fimer 
places a sorf of bound on fhe maximum span of fime by which fragmenfs of fhe 
same dafagram can be separafed. 


Note 

Historically, most Berkeley-UNIX-derived IP implementations simply never gener¬ 
ated this error. While these implementations did set a timer, and d/d discard all 
fragments when the timer expired, the ICMP error was never generated. Another 
detail one sometimes encounters is that an implementation is not required to 
generate the ICMP error unless the f/rsf fragment has been received (i.e., the one 
with the 0 Fragment Offsett\e\d). The reason is that the receiver of the ICMP error 
cannot tell which user process sent the datagram that was discarded, because 
the transport-layer header is not available. It is assumed that higher-layer proto¬ 
cols will eventually time out and retransmit if necessary. 
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10.8 Path MTU Discovery with UDP 

Let us examine the interaction between an application using UDP and the path 
MTU discovery mechanism (PMTUD) [RFC1191]. For a protocol such as UDP, in 
which the calling application is generally in control of the outgoing datagram size, 
it is useful if fhere is some way fo defermine an appropriafe dafagram size if frag- 
menfafion is fo be avoided. Convenfional PMTUD uses ICMP PTB messages (see 
Chapfer 8) in defermining fhe largesf packef size along a roufing pafh fhaf can 
be used wifhouf inducing fragmenfafion. These messages are fypically processed 
below fhe UDP layer and are nof direcfly visible fo an applicafion, so eifher an API 
call is used for fhe applicafion fo learn fhe besf currenf esfimafe of fhe pafh MTU 
size for each desfinafion wifh which if has communicafed, or fhe IP layer can per¬ 
form PMTUD independenfly wifhouf fhe applicafion knowing. The IP layer offen 
caches PMTUD informafion on a per-desfinafion basis and fimes if ouf if if is nof 
refreshed. 

10.8.1 Example 

In fhe following example, we use fhe sock program fo creafe a UDP dafagram 
fhaf produces a 1501-byfe IPv4 dafagram. Bofh our hosf sysfem and fhe affached 
LAN supporf an MTU larger fhan 1500 byfes, buf fhe oufgoing link fo fhe Infernef 
af fhe roufer does nof. The command affempfs fo send fhree UDP messages fo fhe 
echo service (UDP porf 7) in quick succession. 

Linux% sock -u -i -n 3 “Wl473 www.cs.berkeley.edu echo 

Lisfing 10-6 illusfrafes fhe corresponding packef frace we can see using tcp- 
dump af fhe sender (some lines are wrapped for clarify). 

Listing 10-6 tcpdump output illustrating ICMP PTB message. The suggested MTU is included. 

1 14:42:18.359366 IP (tos 0x0, ttl 64, id 18331, offset 0, flags [DF], 

proto UDP (17), length 1501) 

12.46.129.28.33954 > 128.32.244.172.7: UDP, length 1473 

2 14:42:18.359384 IP (tos 0x0, ttl 64, id 18332, offset 0, flags [DF], 

proto UDP (17), length 1501) 

12.46.129.28.33954 > 128.32.244.172.7: UDP, length 1473 

3 14:42:18.359402 IP (tos 0x0, ttl 64, id 18333, offset 0, flags [DF], 

proto UDP (17), length 1501) 

12.46.129.28.33954 > 128.32.244.172.7: UDP, length 1473 

4 14:42:18.360156 IP (tos 0x0, ttl 255, id 23457, offset 0, 

flags [none], proto ICMP (1), length 56) 

12.46.129.1 > 12.46.129.28: ICMP 

128.32.244.172 unreachable - need to frag (mtu 1500), length 36 
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IP (tos 0x0, ttl 63, id 18331, offset 0, flags [DF], 
proto UDP (17), length 1501) 

12.46.129.28.33954 > 128.32.244.172.7: UDP, length 1473 


In Listing 10-6 we see three UDP datagrams of 1473 UDP (application) pay- 
load bytes each. Each produces a 1501-byte (unfragmented) IPv4 datagram. Each 
of fhese dafagrams has fhe IPv4 DF bif field fumed on (fhe defaulf on fhis sysfem), 
so when one of fhem reaches a roufer (IPv4 address 12.46.129.1), an ICMPv4 PTB 
message is produced, which includes fhe suggesfed nexf-hop MTU of 1500 byfes. 
We may also observe fhaf fhe ICMPv4 messages produced confain fhe UDP/IPv4 
headers (and firsf 8 dafa byfes) from our discarded ("offending") dafagrams. In 
fhis example, our sock program senf ifs dafagrams so quickly (in under a mil¬ 
lisecond) fhaf if complefed ifs execufion before any of fhe ICMP messages were 
refurned and processed. 


Note 

The 1500-byte MTU is now a common minimum MTU among ISPs. Some iSPs 
that incorporate PPPoE for address assignment and management use smalier, 
1492-byte MTUs. The PPPoE header (see Chapter 3) comprises 6 bytes, and the 
foliowing PPP header is 2, leaving 1500-6-2 = 1492 bytes for the encapsuiated 
datagram. 


If we use anofher desfinafion hosf (one abouf which we have no pafh MTU 
hisfory), and we add addifional delay befween wrifes, we can observe differenf 
behavior. Using fhe sock command wifh fhe -p 2 opfion, which adds 2s of delay 
befween each send, we use fhe following fwo (idenfical) commands: 


Linux% sock -u -i -n 3 “Wl473 -p 2 www.wisc.edu echo 
write returned -1, expected 1473: Message too long 
Linux% sock -u -i -n 3 “Wl473 -p 2 www.wisc.edu echo 


The tcpdump oufpuf, using an alfernafive version of tcpdump, for fhese com¬ 
mands is given in Lisfing 10-7 (some lines are wrapped for clarify). 


Listing 10-7 Illustration of successful Path MTU discovery on 3000-byte MTU link adapting to 
1500-byte path MTU 


1 17:22:16.331023 IP (tos 0x0, ttl 64, id 58648, offset 0, flags [DF], 

proto: UDP (17), length: 1501) 

12.46.129.28.33955 > 144.92.9.185.7: UDP, length 1473 

2 17:22:16.331581 IP (tos 0x0, ttl 255, id 38518, offset 0, 

flags [none], proto: ICMP (1), length: 56) 

12.46.129.1 > 12.46.129.28: ICMP 

144.92.9.185 unreachable - need to frag (mtu 1500), length 36 
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IP (tos 0x0, ttl 63, id 58648, offset 0, flags [DF], 
proto: UDP (17), length: 1501) 

12.46.129.28.33955 > 144.92.9.185.7: UDP, length 1473 

3 17:22:24.284866 IP (tos 0x0, ttl 64, id 53776, offset 0, flags [+], 

proto: UDP (17), length: 1500) 

12.46.129.28.33955 > 144.92.9.185.7: UDP, length 1473 

4 17:22:24.284873 IP (tos 0x0, ttl 64, id 53776, offset 1480, 

flags [none], proto: UDP (17), length: 21) 

12.46.129.28 > 144.92.9.185: udp 

5 17:22:26.293554 IP (tos 0x0, ttl 64, id 53777, offset 0, flags [+], 

proto: UDP (17), length: 1500) 

12.46.129.28.33955 > 144.92.9.185.7: UDP, length 1473 

6 17:22:26.293559 IP (tos 0x0, ttl 64, id 53777, offset 1480, 

flags [none], proto: UDP (17), length: 21) 

12.46.129.28 > 144.92.9.185: udp 

7 17:22:28.301469 IP (tos 0x0, ttl 64, id 53778, offset 0, flags [+], 

proto: UDP (17), length: 1500) 

12.46.129.28.33955 > 144.92.9.185.7: UDP, length 1473 

8 17:22:28.301474 IP (tos 0x0, ttl 64, id 53778, offset 1480, 

flags [none], proto: UDP (17), length: 21) 

12.46.129.28 > 144.92.9.185: udp 


In Listing 10-7 we can see that the first time we ran our program it resulted 
in an error due to the ICMPv4 PTB message. The extra time provided within and 
between runs provides an opportunity for fhe PTB message fo reach fhe sending 
hosf and for fhe error condifion fo be delivered back fo fhe sender for processing. 
Inferesfingly, when we run fhe program a second fime, fhe pafh MTU has been 
discovered fo be 1500 byfes and fhe sysfem is able fo send fhe program's fhree 
dafagrams using fragmenfafion (packefs 3, 5, and 7 indicafe fhe firsf fragmenfs of 
fhe fhree dafagrams). Affer 15 minufes (nof illusfrafed), fhe pafh MTU informa- 
fion is considered sfale, fhe dafagram is senf unfragmenfed, anofher ICMPv4 PTB 
message is refurned, and fhe process repeafs. 


Note 

[RFC1191] recommends a PMTU value determined using PMTUD to be consid¬ 
ered staie after 10 minutes. Path MTU discovery can sometimes cause probiems 
because firewails and filtering gateways may drop iCMP traffic indiscriminately, 
which can harm the PMTU discovery aigorithm. Because of this, it is possible to 
disable PMTU discovery on a system-wide or finer-granuiarity basis. On Linux, the 
file /proc/sys/net/ipv4/ip_no_pmtu_disc can have a 1 written to it to dis¬ 
able the feature. On Windows, it involves editing the registry entry hkey_local_ 
MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\ 
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EnablePMTUDiscovery to include the value 0. An alternative to conventional 
PMTUD that does not use ICMP has also been developed [RFC4821]; we will 
discuss it in the context of TCP in Chapter 15. 


10.9 Interaction between IP Fragmentation and ARP/ND 

Using UDP, we can see the relationship between induced IP fragmentation and 
typical implementations of ARP. Recall fhaf ARP is used fo map IP-layer addresses 
fo corresponding MAC-layer addresses on fhe same IPv4 subnef (see Chapfer 4). 
The quesfions wifh which we are concerned include. When mulfiple fragmenfs are 
fo be senf, how many ARP messages should be generafed, and how many of fhe 
fragmenfs are held unfil a pending ARP requesf/response is complefed? (Similar 
quesfions apply wifh IPv6 ND.) Refurning fo our hosf and LAN using a 1500-byfe 
MTU, we use fhe following fwo commands fo see fhe answer: 


Linux% sock -u -i -nl “W8192 10.0.0.20 echo 
Linux% sock -u -i -nl “W8192 10.0.0.3 echo 

These argumenfs cause our sock program fo generafe a single UDP dafagram 
wifh 8192 byfes of user dafa. We expecf fhis fo generafe six fragmenfs on an Efher- 
nef using a 1500-byfe MTU size. We also make sure fhaf fhe ARP cache is empfy 
before running fhe program, so fhaf an ARP requesf and reply musf be exchanged 
before any fragmenfs are senf (see Lisfing 10-8; some lines are wrapped for clarify). 


Listing 10-8 ARP and fragmentation on Ethernet with 1500-byte MTU 

1 15:45:49.063561 arp who-has 10.0.0.20 tell 10.0.0.5 

2 15:45:50.059523 arp who-has 10.0.0.20 tell 10.0.0.5 

3 15:45:51.059505 arp who-has 10.0.0.20 tell 10.0.0.5 

4 15:46:08.555725 arp who-has 10.0.0.3 tell 10.0.0.5 

5 15:46:08.555973 arp reply 10.0.0.3 is-at 0:0:cO:c2:9b:26 

6 15:46:08.555992 10.0.0.5 > 10.0.0.3: 

udp (frag 27358:148002960+) (ttl 64, len 1500) 

7 15:46:08.555998 10.0.0.5 > 10.0.0.3: 

udp (frag 27358:148001480+) (ttl 64, len 1500) 

8 15:46:08.556004 10.0.0.5.32808 > 10.0.0.3.7: 

udp 8192 (frag 27358:148000+) (ttl 64, len 1500) 


For fhis experimenf, we happen fo know fhaf fhere is no running hosf assigned 
address 10.0.0.20, so we should expecf no reply. In fhe firsf parf of Lisfing 10-8 
(packefs 1-3), we observe fhree ARP requesfs spaced approximafely Is aparf. No 
hosf responds affer fhree requesfs are senf, so fhe ARP requestor gives up. In fhe 
nexf case, an ARP response is received in abouf 250ps, and a fragmenf is senf 
abouf 20ps fhereaffer. Affer fhis, fhe remaining fragmenfs are senf very closely 
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together, within about 6ps of each other. Recall that in this system (Linux), the last 
fragment is sent first. 


Note 

Historically, the interaction between fragmentation and ARP has been probiem- 
atic. For exampie, in some cases an ARP request was sent for each fragment, and 
in many cases only one of the fragments was queued pending the ARP response 
(thus iosing the datagram, as ali but one of its fragments were discarded). The 
first problem was addressed in [RFC1122], which requires an implementation to 
prevent this type of ARP fiooding. The recommended maximum rate is one per 
second. The second problem is aiso discussed in [RFC1122], but this states oniy 
that the link layer “SHOULD save (rather than discard) at ieast one (the iatest) 
packet of each set of packets destined to the same unresoived IP address, and 
transmit the saved packet when the address has been resolved.” This approach 
can lead to unnecessary packet loss and has been addressed in individual imple¬ 
mentations by providing a larger queue for packets while their ARP requests are 
pending. 


10.10 Maximum UDP Datagram Size 

Theoretically, the maximum size of an IPv4 datagram is 65,535 bytes, imposed by 
the 16-bit Total Length field in the IPv4 header (see Chapter 5). With an optionless 
IPv4 header of 20 bytes and a UDP header of 8 bytes, this leaves a maximum of 
65,507 bytes of user data in a UDP datagram. For IPv6, the 16-bit Payload Length 
field permits an effective UDP payload of 65,527 bytes (8 of the 65,535 IPv6 payload 
bytes are used for the UDP header), assuming jumbograms are not being used. 
There are two main reasons why a full-size datagram of these sizes may not be 
delivered end-to-end, however. First, the system's local protocol implementation 
may have some limitation. Second, the receiving application may not be prepared 
to handle such large datagrams. 

10.10.1 Implementation Limitations 

Protocol implementations provide an API to applications that pick some default 
buffer size for sending and receiving. Some implementations provide defaults that 
are less than the maximum IP datagram size, and some actually do not support 
sending datagrams larger than a few tens of kilobytes (although this problem is 
not common). 

The sockets API [UNP3] provides a set of functions that an application can 
call to set or query the size of the receive and send buffers. For a UDP socket, this 
size is directly related to the maximum size of UDP datagram the application can 
read or write. Typical default values are 8192 bytes or 65,535 bytes, but these can 
generally be made larger by invoking the setsockopt () API call. 
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We mentioned in Chapter 5 that a host is required to provide enough buffering 
fo receive af leasf a 576-byfe IPv4 dafagram on reassembly. Many UDP applicafions 
are designed fo resfricf fheir applicafion dafa size fo 512 byfes or less (resulfing in 
IPv4 dafagrams under 576 byfes), fo sfay below fhis limif. Examples employing 
such limifafions fo fheir UDP dafagram size include fhe DNS (see Chapfer 11) and 
DHCP (see Chapfer 6). 

10.10.2 Datagram Truncation 

Just because UDP/IP is capable of sending and receiving a datagram of a given 
(large) size does not mean the receiving application is prepared to read that size. 
UDP programming interfaces allow the application to specify the maximum num¬ 
ber of bytes to return each time a network read operation completes. What hap¬ 
pens if the received datagram exceeds the size specified? 

In most cases, the answer to this question is that the API truncates the data¬ 
gram, discarding any excess data in the datagram beyond the number of bytes 
specified by the receiving application. However, the exact behavior varies from 
implementation to implementation. Some systems provide the unconsumed por¬ 
tion of the datagram in subsequent read operations, and others inform the caller of 
how much data was truncated (or, in yet other cases, that some data was truncated, 
but not exactly how much). 


Note 

In Linux, the MSG_TRUNC option may be given to the sockets API to discover 
how much data was truncated. On HP-UX, MSG_TRUNC is instead a fiag set 
when a read cali returns that some data was truncated. The sockets API under 
SVR4 (including Solaris 2.x) does not truncate the datagram. Any excess data is 
returned in subsequent reads. The application is not notified that multiple reads 
are being fulfilled from a single UDP datagram. 


When we discuss TCP we shall see that it provides a continuous stream of 
bytes to the application, without any message boundaries. Thus, an application 
consumes however much data it requests, provided sufficient data is available (if 
not, it usually waits). 


10.11 UDP Server Design 

There are some characteristics of UDP that affect the design and implementation 
of networking application software wishing to use it [RFC5405]. Servers typically 
interact with the operating system, and most need a way to handle multiple cli¬ 
ents at the same time. Client design and implementation are usually simpler, and 
therefore we will not discuss them here. 
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In the typical client/server scenario, a client starts, immediately communi¬ 
cates with a single server, and is done. Servers, on the other hand, start and then 
go to sleep, waiting for a client's request to arrive. They awaken when a client's 
datagram arrives, which usually requires the server to evaluate the request and 
possibly perform furfher processing. Our inferesf here is nof in fhe programming 
aspecfs of clienfs and servers ([UNP3] covers all fhose defails) buf in fhe profocol 
feafures of UDP fhaf affecf fhe design and implemenfafion of a server using UDP. 
(We examine fhe defails of TCP server design in Chapfer 13.) Alfhough some of 
fhe feafures we describe depend slighfly on fhe implemenfafion of UDP being 
used, fhe feafures are common fo mosf implemenfafions. 

10.11.1 IP Addresses and UDP Port Numbers 

Whaf arrives af a UDP server from a clienf is a UDP dafagram. The IP header con- 
fains fhe source and desfinafion IP addresses, and fhe UDP header confains fhe 
source and desfinafion UDP porf numbers. When an applicafion receives a UDP 
message, fhe IP and UDP headers have been sfripped off; fhe applicafion musf be 
fold by fhe operafing sysfem in some ofher way who senf fhe message (fhe source 
IP address and porf number), if if infends fo furnish a reply. This feafure allows a 
UDP server fo handle mulfiple clienfs. 

Some servers need fo know to whom fhe dafagram was senf, fhaf is, fhe desfi¬ 
nafion IP address. While if may seem obvious fhaf such informafion would imme- 
diafely be known by a server wifhouf looking info fhe received dafagram, fhis 
is nof always fhe case. For example, because of mulfihoming, IP address alias¬ 
ing, and ordinary IPv6 usage wifh mulfiple scopes, a hosf may have mulfiple IP 
addresses, and a single server may receive incoming dafagrams using any of fhem 
(fhis is in facf fhe common case). Any server wishing fo perform ifs fasks differ- 
enfly depending on fhe desfinafion IP address selecfed by fhe clienf would require 
access fo fhe desfinafion IP address informafion. In addifion, some services may 
respond differenfly if fhe desfinafion address is broadcasf or mulficasf (e.g., 
fhe Hosf Requiremenfs RFC [RFC1122] sfafes fhaf a TFTP server should ignore 
received dafagrams fhaf are senf fo a broadcasf address). 


Note 

A DNS server is one type of server that is sensitive to the destination IP address. It 
can use this information to arrange a particular sorting order on the address map¬ 
pings it returns. This behavior of DNS is described in more detail in Chapter 11. 


The lesson here is that even though an API may deliver all the data contained 
in a transport-layer datagram, additional information from the various layers 
(typically addressing information) may be required for a server to operate most 
effectively. This issue is not unique to UDP, of course, but because it is the first 
transport-layer protocol we study, it is worthwhile to point out now. 
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UDP servers designed for use with both IPv4 and IPv6 must consider the fact 
that these two types of addresses have significantly different lengths and require 
different data structures. In addition, the interoperability mechanism of encoding 
IPv4 addresses in IPv6 addresses may allow the use of IPv6 sockets to handle both 
IPv4 and IPv6 addressing. See [UNP3] for more details. 

10.11.2 Restricting Locai iP Addresses 

Most UDP servers wildcard their local IP address when they bind a UDP end¬ 
point. This means that an incoming UDP datagram destined for the server's port 
is accepted on any local IP address (any IP address in use on the local machine, 
including the local loopback address). For example, we can start an IPv4 UDP 
server on port 7777: 


Linux% sock -u -s 7777 


We can then use the netstat command to see the state of the endpoint (see List¬ 
ing 10-9). 


Listing 10-9 netstat listing of IPv4 UDP servers using wildcarded address bindings 

Linux% netstat -1 --udp -n 

Active Internet connections (only servers) 

Proto Recv-Q Send-Q Local Address Foreign Address 

udp 0 0 *:7777 O.O.O.O:* 


We have deleted several lines of output other than the one in which we are 
interested. The -1 flag reports on all listening sockets (servers). The —udp flag 
provides data relating only to the UDP protocol. The -n flag prints IP addresses 
rather than fully expanded host names. 


Note 

While not all systems provide exactly these (Linux) flags for netstat, most pro¬ 
vide the netstat command with some combination of flags to obtain similar 
results. On BSD, the -1 and -p udp flags are supported. On Windows, the -n, 
-a, and -p udp flags can be used. 


The local address is printed as *:7777, where the asterisk means that the local 
IP address has been wildcarded. When the server creates its endpoint, it can spec¬ 
ify one of the host's local IP addresses, including a broadcast address, as the local 
IP address for the endpoint. In such cases, incoming UDP datagrams are then 
passed to this endpoint only if the destination IP address matches the specified 
local address. With our sock program, if we specify an IP address before the 
port number, that IP address becomes the local IP address for the endpoint. For 
example, the command 
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Linux% sock -u -s 127.0.0.1 7777 

restricts the server to accepting only datagrams arriving on the local loopback 
interface (127.0.0.1), which can be generated only on the same host. The nets tat 
output in Listing 10-10 shows this case. 

Listing 10-10 netstat listing of UDP IPv4 server bound to only the local loopback interface 
Active Internet connections (only servers) 

Proto Recv-Q Send-Q Local Address Foreign Address 

udp 0 0 127.0.0.1:7777 O.O.O.O:* 


If we fry fo send fhis server a dafagram from a hosf on fhe same Efhernef, 
an ICMPv4 Porf Unreachable message is refurned, and fhe sending applicafion 
receives an error. The server never sees fhe dafagram. 

Linux% sock -u -v 10.0.0.3 6666 

connected on 10.0.0.5.50997 to 10.0.0.3.6666 
123 

error: Connection refused 


10.11.3 Using Multiple Addresses 

If is possible fo sfarf differenf servers on fhe same porf number, each wifh a differ- 
enf local IP address. Normally, however, fhe sysfem musf be fold by fhe applica¬ 
fion fhaf if is OK fo reuse fhe same porf number in fhis way. 


Note 

With the sockets API, the SO_REUSEADDR socket option must be specified. 
This is done in our sock program by specifying the -a option. 


Even if we have only one frue nefwork inferface, we can esfablish addifional 
IP addresses for if fo use. Here, our hosf has a nafive IPv4 address of 10.0.0.30, buf 
we will give if fwo addifional addresses: 


Linux# ip addr add 10.0.2.13 scope host dev ethO 
Linux# ip addr add 10.0.2.14 scope host dev ethO 


Now our hosf has four unicasf IPv4 addresses: ifs nafive address, fhe fwo 
we have jusf added, plus ifs local loopback address. We can sfarf fhree differenf 
insfances of fhe UDP on fhe same porf using our sock program on fhe same UDP 
porf (8888): 


Linux% sock -u -s -A 10.0.2.13 8888 
Linux% sock -u -s -A 10.0.2.14 8888 
Linux% sock -u -s -A 8888 
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The servers must be started with the -A flag, telling the system that it is OK 
to reuse the same addressing information. The netstat output in Listing 10-11 
shows the addresses and port numbers on which the servers are listening. 


Listing 10-11 Restricted and wildcarded UDP servers on the same UDP port 
Active Internet connections (only servers) 


Proto 

Recv-Q 

Send-Q 

Local Address 

Foreign 

Address 

udp 

0 

0 

10.0.2.13:8888 

0.0.0.0 

* 

udp 

0 

0 

0.0.0.0:8888 

0.0.0.0 

* 

udp 

0 

0 

10.0.2.14:8888 

0.0.0.0 



In this scenario, the only IPv4 datagrams that will go to the server with the 
wildcarded local address are those destined for 10.0.0.30, fhe direcfed broadcasf 
address (e.g., 10.255.255.255), fhe limifed broadcasf address (255.255.255.255), or 
fhe local loopback address (127.0.0.1), because fhe resfricfed servers cover all ofher 
possibilifies. 

There is a priorify implied when an endpoinf wifh a wildcard address exisfs. 
An endpoinf wifh a specific IP address fhaf mafches fhe desfinafion IP address is 
always chosen over a wildcard. The wildcard endpoinf is used only when a spe¬ 
cific mafch is nof found. 

10.11.4 Restricting Foreign iP Address 

In all the netstat output that we showed earlier, the foreign IP address (i.e., 
the one not local to the host where the server is running) and foreign port num¬ 
ber are shown as 0.0.0.0: meaning that the endpoint will accept an incoming 
UDP datagram from any IPv4 address and any port number. However, there is an 
option to restrict the foreign address. This means that the endpoint receives UDP 
datagrams only from that specific IPv4 address and port number. Note that this 
restriction can be added once a server has heard from a client, in order to filter 
out additional traffic from other clients. Our sock program uses the -f option to 
specify the foreign IPv4 address and port number: 

Linux% sock -u -s -f 10.0.0.14.4444 5555 

This sets the foreign IPv4 address to 10.0.0.14 and the foreign port number 
to 4444. The server's port is 5555. If we run netstat, we see that the local address 
has also been set, even though we did not specify it explicitly (see Listing 10-12). 

Listing 10-12 Restricting the foreign address causes assignment of a local address. 

Linux% netstat --udp -n 

Active Internet connections (w/o servers) 

Proto Recv-Q Send-Q Local Address Foreign Address State 

udp 0 0 10.0.0.30:5555 10.0.0.14:4444 ESTABLISHED 
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This is a typical side effect of specifying fhe foreign IP address and foreign 
porf: if fhe local address has nof been chosen when fhe foreign address is speci¬ 
fied, fhe local address is chosen aufomafically. Ifs value becomes fhe IP address of 
fhe inf erf ace chosen by IP roufing fo reach fhe specified foreign IP address. Indeed, 
in fhis example fhe primary IPv4 address for fhe Efhernef fhaf is connecfed fo fhe 
foreign address is 10.0.0.30. Nofe fhaf as a consequence of fhe endpoinfs being 
defermined and fhe foreign address resfricfed, fhe State column now indicafes 
fhaf fhe associafion is ESTABLISHED. 

Table 10-2 summarizes fhe fhree fypes of address bindings fhaf a UDP server 
can esfablish. 


Table 10-2 Types of address bindings for a UDP server 


Local Address 

Foreign Address 

Description 

local_lP.local_port 

foreign_IP. foreign_port 

Restricted to one client 

local_lP.local_port 

*.* (wildcard) 

Restricted to one local IP 
address and port (but for 
any client) 

*.local_port 

*.* (wildcard) 

Restricted to local port 
only 


In all cases, local_port is fhe server's porf and local_IP musf be one of 
fhe locally assigned IP addresses. The ordering of fhe fhree rows in fhe fable is fhe 
order fhaf fhe UDP module applies when frying fo defermine which local end- 
poinf receives an incoming dafagram. The mosf specific binding (fhe firsf row) is 
fried firsf, and fhe leasf specific (fhe lasf row wifh bofh IP addresses wildcarded) 
is fried lasf. 

10.11.5 Using Multiple Servers per Port 

Although it is not specified in the RFCs, by default most implementations allow 
only one application endpoint at a time to be associated with any one (local IP 
address, UDP port number) pair for a given address family (i.e., IPv4 or IPv6). When 
a UDP datagram arrives at a host destined for its IP address and an active port 
number, one copy is delivered to that single endpoint (e.g., a listening application). 
The IP address of the endpoint can be the wildcard, as shown earlier, but only a 
single application can receive datagrams for the address(es) specified. If we then 
try to start another server with the same wildcarded local address and the same 
port using the same address family, it does not work: 

Linux% sock -u -s 12.46.129.3 8888 & 

Linux% sock -u -s 12.46.129.3 8888 

can't bind local address: Address already in use 
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In support of multicasting (see Chapter 9), multiple endpoints can be allowed 
to use the same (local IP address, UDP port number) pair, although the applica¬ 
tion normally must tell the API that this is OK (i.e., our -A flag to specify the 
SO_REUSEADDR socket option illustrated previously). 


Note 

4.4BSD requires the application to set a different socket option (SO_REUSEPORT) 
to allow multiple endpoints to share the same port. Furthermore, each endpoint 
must specify this option, including the first one to use the port. 


When a UDP datagram arrives whose destination IP address is a broadcast or 
multicast address, and there are multiple endpoints at the destination IP address 
and port number, one copy of the incoming datagram is passed to each endpoint. 
(The endpoint's local IP address can be the wildcard, which matches any destina¬ 
tion IP address.) But if a UDP datagram arrives whose destination IP address is 
a unicast address (i.e., an ordinary address), only a single copy of the datagram 
is delivered to one of the endpoints. Which endpoint gets the unicast datagram 
is implementation-dependent, but this policy helps to allow multithreaded and 
multiprocess servers to operate without being invoked multiple times on the same 
incoming request. 

10.11.6 Spanning Address Families: IPv4 and IPv6 

It is possible to write servers that span not only protocols (such as servers that 
respond to both TCP and UDP) but also across address families. That is, we may 
write a UDP server that responds to incoming requests for IPv4 as well as for IPv6. 
While this may seem entirely straightforward (IPv6 addresses are just additional 
IP addresses on the same host that happen to be 128 bits long), there is a subtlety 
related to the sharing of the port space. On some systems, the port space between 
IPv6 and IPv4 for UDP (and TCP) is shared. This means that if a service binds to a 
UDP port using IPv4, it is also allocated the same port in the IPv6 port space (and 
vice versa), preventing other services from using it (unless the SO_REUSEADDR 
socket option is used, as mentioned before). Purthermore, because IPv6 addresses 
can encode IPv4 addresses in an interoperable way (see Chapter 2), wildcard bind¬ 
ings in IPv6 may receive incoming IPv4 traffic. 


Note 

The situation is implementation-specific. In Linux, all port space Is shared, and 
any wildcard IPv6 binding implies a corresponding IPv4 binding. In FreeBSD, the 
IPV6_V60NLY socket option may be used to ensure that bindings are present 
only in the IPv6 space. Programmers should consult the socket interface for IPv6 
for whichever operating environment they are supporting. C language bindings 
are described in [RFC3493]. 
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10.11.7 Lack of Flow and Congestion Control 

Most UDP servers are iterative servers. This means that a single server thread (or 
process) handles all the client requests on a single UDP port (e.g., the server's 
well-known port). Normally there is a limited-size input queue associated with 
each UDP port that an application is using. This means that requests arriving at 
about the same time from different clients are automatically queued by UDP. The 
received UDP datagrams are passed to the application (when it asks for fhe nexf 
one) in fhe order in which fhey were received (i.e., FCFS—firsf come, firsf served). 

If is possible, however, for fhis queue fo overflow, causing fhe UDP implemen- 
fafion fo discard incoming dafagrams. This can happen even if only one clienf is 
being served because UDP provides no flow control (fhaf is, no way for fhe server 
fo fell fhe clienf fo slow down). Because UDP is a connecfionless protocol wifh no 
reliabilify mechanism of ifs own, applicafions are nof fold when fhe UDP inpuf 
queue overflows. The excess dafagrams are jusf discarded by UDP. 

Anofher concern arises from fhe facf fhaf queues are also presenf in fhe IP 
roufers befween fhe sender and fhe receiver—in fhe middle of fhe nefwork. When 
fhese queues become full, fraffic may be discarded in a fashion similar fo fhaf 
of fhe UDP inpuf queue. When fhis happens, fhe nefwork is said fo be congested. 
Congesfion is undesirable because if affecfs all nefwork users wifh fraffic fhaf fra- 
verses fhe poinf where congesfion is occurring, as opposed fo fhe UDP inpuf case 
menfioned previously, where only a single applicafion server was affecfed. UDP 
poses a special concern for congesfion because if has no way of being informed 
fhaf if should slow down ifs sending rate if fhe nefwork is being driven info con¬ 
gesfion. (If also has no mechanism for slowing down, even if if were fold fo do so.) 
Thus, if is said fo lack congestion control. Congesfion confrol is a complex subjecf 
and sfill an acfive area of research. We will refurn fo considerafions of congesfion 
confrol when we discuss TCP (see Chapfer 16). 


10.12 Translating UDP/IPv4 and UDP/IPv6 Datagrams 

In Chapfer 7 we discussed a framework for franslafing IP dafagrams from IPv4 
fo IPv6 and vice versa. Chapfer 8 described how fhis framework applies fo ICMP. 
When UDP passes fhrough a franslafor, fhe franslafion fakes place as described 
in Chapfer 7, excepf fhere are issues specific fo fhe UDP checksum. For UDP/ 
IPv4 dafagrams, fhe UDP header's Checksum field is allowed fo be 0 (uncompufed), 
whereas in UDP/IPv6 fhis is nof allowed. Consequenfly, complefe dafagrams 
arriving wifh a zero checksum being franslafed from IPv4 fo IPv6 resulf in eifher 
a UDP/IPv6 dafagram wifh a fully computed pseudo-header checksum being 
generafed, or wifh fhe arriving packef being dropped. The franslafor is supposed 
fo provide a configurafion opfion fo selecf which is desired, as fhe overhead of 
generafing such checksums may be objecfionable. Packefs confaining a nonzero 
checksum being franslafed in eifher direcfion require fhe checksum fo be updated 
if a non-checksum-neufral address mapping is used (see Chapfer 7). 
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Fragmented datagrams present another challenge. For stateless translators, a 
fragmented UDP/IPv4 datagram with a zero checksum cannot be translated, as 
the appropriate UDP/IPv6 checksum cannot be computed. Such datagrams are 
dropped. Stateful franslafors (i.e., NAT64) can reassemble a number of fragmenfs 
and compufe fhe required checksum. Fragmenfed UDP/IP dafagrams wifh com- 
pufed checksums are handled as ordinary fragmenfs in eifher direcfion, as speci¬ 
fied in Chapfer 7. Large UDP/IPv4 dafagrams fhaf require fragmenfafion fo fif 
wifhin fhe IPv6 minimum MTU affer franslafion are also handled as convenfional 
IPv4 dafagrams (i.e., fhey are fragmenfed as needed). 


10.13 UDP in the Internet 

If we affempf fo characferize fhe amounf of UDP fraffic in fhe Infernef, we find 
fhaf useful, publicly available dafa is somewhaf hard fo come by, and fhaf fhe 
breakdown of fraffic load by profocol varies from sife fo sife. Thaf said, sfudies 
such as [FKMC03] find fhaf UDP accounfs for befween 10% and 40% of Infernef 
fraffic observed, and fhaf as peer-fo-peer applicafions gain in popularify, fhe use 
of UDP is also on fhe rise [Z09], alfhough TCP fraffic sfill dominafes in ferms of 
packefs and byfes. 

In [SMC02], fragmenfafion of Infernef fraffic is found fo be mosf common wifh 
UDP (68.3% of fhe fragmenfed fraffic is UDP), alfhough very liffle fraffic overall 
is fragmenfed (abouf 0.3% of packefs, 0.8% of byfes). The aufhors reporf fhaf fhe 
mosf common type of fraffic fhaf is fragmenfed is UDP-based mulfimedia fraffic 
(53%; Microsoff's Media Player is responsible for abouf half of fhis) and encap- 
sulafed/funneled fraffic such as fhaf presenf in VPN funnels (abouf 22%). Fur- 
fhermore, abouf 10% of fhe fragmenfafion is reverse-order (we said fhis earlier in 
fhe examples where fhe lasf IP fragmenf was senf prior fo fhe firsf), and fhe mosf 
commonly seen fragmenf size is 1500 byfes (79%), followed by 1484 byfes (18%) 
and 1492 byfes (1%). 


Note 

The 1500-byte MTU is related to the native usable payload size for Ethernet. The 
1484 size was produced by Digital Equipment Corporation’s GigaSwitch (now 
defunct), which represented significant portions of the topology measured at the 
time. 


The causes of fragmenfafion appear fo derive from fwo facfors: careless encap- 
sulafion and lack of pafh MTU discovery and adapfafion for applicafions fhaf like 
fo use large messages. The former case relafes fo mulfiple levels of encapsulafion 
across many profocol layers fhaf add addifional headers, forcing IP packefs fhaf 
inifially fif info 1500-byfe MTUs (fhe mosf common size) fo no longer fif (e.g., appli- 
cafion fraffic carried over VPN funnels). The second factor arises for applicafions 
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that use larger packets (e.g., video applications) that end up being fragmented. A 
curious (and unfortunate) finding in the [SMC02] study is that numerous UDP 
packets with the IPv4 DF bit field turned on (presumably trying to perform PMTU 
discovery) are encapsulated in UDP packets that do not (thereby defeating the 
attempt and leaving the responsible application ignorant of the fact). 


10.14 Attacks Involving UDP and IP Fragmentation 

Most attacks involving UDP relate to exhaustion of some shared resource (buffers, 
link capacity, etc.) or exploitation of bugs in protocol implementations causing sys¬ 
tem crashes or other undesired behavior. Both fall into the broad category of DoS 
attacks: the successful attacker is able to cause services to be made unavailable to 
legitimate users. The most straightforward DoS attack with UDP is simply gener¬ 
ating massive amounts of traffic as fast as possible. Because UDP does not regulate 
its sending traffic rate, this can negatively impact the performance of other appli¬ 
cations sharing the same network path. This can happen even without malicious 
intent. 

A more sophisticated form of DoS attack frequently associated with UDP is a 
magnification attack. This type of attack generally involves an attacker sending a 
small amount of traffic that induces other systems to generate much more. In the 
so-called/rayyU attack, a malicious UDP sender forges the IP source address to be 
that of a victim and sets the destination address to a form of broadcast (e.g., the 
directed broadcast address). UDP packets are sent to a service that generates traf¬ 
fic in response to an incoming datagram. When the servers implementing these 
services respond, they direct their messages to the IP address contained in the 
Source IP Address field of the arriving UDP packet. In this case, the source address 
is that of the victim, and so the victim host is subject to being overloaded by the 
multiple UDP traffic responders. Variants of this magnification attack are numer¬ 
ous, including inducing a character-generating service to be coupled to the echo 
service, thereby causing traffic to be "ping-ponged" forever. This attack is closely 
related to the ICMP smurf attack (see Chapter 8). 

Several attacks involving IP fragmentation have appeared. IP fragmenta¬ 
tion processing is somewhat more complex than UDP processing, so it is not so 
surprising that bugs in its implementation have been found and exploited. One 
form of attack involves sending fragments that contain no data whatsoever. This 
attack exploited a bug in IPv4 reassembly code and caused some systems to crash. 
Another attack on the IPv4 reassembly layer is the teardrop attack, which involves 
carefully constructing a series of fragments with overlapping Fragment Offset 
fields that crash or otherwise badly affect some systems. A variant of this involves 
overlapping fragment offsets that overwrite the UDP header from an earlier frag¬ 
ment. Overlapping fragments are now prohibited with IPv6 [RFC5722]. Finally, 
the also-related ping of death attack (typically constructed with ICMPv4 Echo 
Request but also applicable to UDP) operates by creating an IPv4 datagram that on 
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reassembly exceeds the maximum limit. This is fairly straightforward because fhe 
Fragment Offset field can be sef fo a value as high as 8191, which represenfs a byfe 
offsef of 65,528 byfes. Any such fragmenf wifh lengfh exceeding 7 byfes would—if 
nof prevenfed from doing so—resulf in a reconsfrucfed dafagram exceeding fhe 
maximum size of 65,535 byfes. Mifigafion fechniques for some forms of fragmenf 
affacks are given in [RFC3128]. 


10.15 Summary 

UDP is a simple profocol. Ifs official specificafion, [RFC0768], requires only fhree 
pages (including references!). The services if provides fo a user process, above 
and beyond IP, are porf numbers and a checksum. If provides no flow confrol, 
no congesfion confrol, and no error correcfion. If does provide error defecfion 
(opfional for UDP/IPv4 buf mandatory for UDP/IPv6) and preservafion of mes¬ 
sage boundaries. We used UDP fo examine fhe Infernef checksum and fo see how 
IP fragmenfafion is performed. We also looked af ofher aspecfs of UDP: how if is 
used wifh pafh MTU discovery, how if impacfs server design, and ifs presence in 
fhe Infernef. 

UDP is mosf commonly used when fhe overhead of connecfion esfablishmenf 
is fo be avoided, when mulfipoinf delivery (mulficasfing, broadcasfing) is used, 
or when fhe comparafively "heavyweighf" reliabilify semanfics of TCP (such as 
sequencing, flow confrol, and refransmission) are nof desired. If has enjoyed a 
growing level of use because of mulfimedia and peer-fo-peer applicafions and is 
fhe primary profocol for supporfing VoIP [RFC3550][RFC3261]. If is also a conve- 
nienf mefhod for encapsulafing fraffic fhaf musf fransifion a NAT wifhouf infro- 
ducing much exfra overhead (only 8 byfes for fhe UDP header). We have seen fhis 
use for supporfing an IPv6 fransifion mechanism (Teredo) and for aiding NAT 
fraversal wifh STUN (see Chapfer 7), and we will see if again in Chapfer 18 where 
if is used for IPsec NAT fraversal. One of UDP's ofher major uses is for supporfing 
fhe DNS. We explore fhis imporfanf applicafion nexf, in Chapfer 11. 
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Name Resolution and the 
Domain Name System (DNS) 


11.1 Introduction 

The protocols we have studied so far operate using IP addresses to identify fhe 
hosfs fhaf parficipafe in a disfribufed applicafion. These addresses (especially IPv6 
addresses) are cumbersome for humans fo use and remember, so fhe Infernef sup- 
porfs fhe use of host names fo idenfify hosfs, bofh clienfs and servers. In order fo be 
used by protocols such as TCP and IP, hosf names are converted info IP addresses 
using a process known as name resolution. There are differenf forms of name reso- 
lufion in fhe Infernef, buf fhe mosf prevalenf and imporfanf one uses a disfribufed 
dafabase sysfem known as fhe Domain Name System (DNS) [MD88]. DNS runs as 
an applicafion on fhe Infernef, using IPv4 or IPv6 (or bofh). For scalabilify, DNS 
names are hierarchical, as are fhe servers fhaf supporf name resolufion. 

DNS is a disfribufed clienf/server nefworked dafabase fhaf is used by TCP/IP 
applicafions fo map befween hosf names and IP addresses (and vice versa), fo pro¬ 
vide elecfronic mail roufing informafion, service naming, and ofher capabilifies. 
We use fhe ferm disfribufed because no single sife on fhe Infernef knows all of fhe 
informafion. Each sife (universify deparfmenf, campus, company, or deparfmenf 
wifhin a company, for example) mainfains ifs own dafabase of informafion and 
runs a server program fhaf ofher systems across fhe Infernef (clienfs) can query. 
The DNS provides fhe protocol fhaf allows clienfs and servers fo communicate 
wifh each ofher and also a profocol for allowing servers fo exchange informafion. 

From an applicafion's poinf of view, access fo fhe DNS is fhrough an applica¬ 
fion library called a resolver. In general, an applicafion musf converf a hosf name 
fo an IPv4 and/or IPv6 address before if can ask TCP fo open a connecfion or send 
a unicasf dafagram using UDP The TCP and IP profocol implemenfafions know 
nofhing abouf fhe DNS; fhey operafe only wifh fhe addresses. 
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In this chapter we will take a look at how the names in DNS are set up, how 
resolvers and servers communicate using the Internet protocols (mainly UDP), and 
some of the other resolution mechanisms that are used in Internet environments. 
We do not cover all of the administrative details of running a name server or all of 
the options available with resolvers and servers. Such information is available from 
various other sources, including Albitz and Liu's DNS and BIND text [AL06] and in 
[RFC6168]. We discuss the details of DNS security (DNSSEC) in Chapter 18. 


11.2 The DNS Name Space 

The set of all names used with DNS constitutes the DNS name space. This space is 
partitioned hierarchically and is case insensitive, similar to computer file system 
folders (directories) and files. The current DNS name space is a tree of domains 
with an unnamed root at the top. The top echelons of the tree are the so-called 
top-level domains (TLDs), which include generic TLDs (gTLDs), country-code TLDs 
(ccTLDs), and internationalized country-code TLDs (IDN ccTLDs), plus a special 
infrastructure TLD called, for historical reasons, ARP A [RFC3172]. These form the 
top levels of a naming tree with the form shown in Figure 11-1. 

There are five commonly used groups of TLDs, and one group of specialized 
domains being used for internationalized domain names (IDNs).^ The history of IDNs, 
one piece of the "internationalization" or "il8n" of the Internet, is long and some¬ 
what complicated. Across the world, there are multiple languages, and each uses 
one or more written scripts. While the Unicode standard [Ull] aims to capture 
the entire set of characters, many characters look the same but have different Uni¬ 
code values. Furthermore, characters written as text may flow from right to left, left 
to right, or (when combining certain texts with others) in both directions. Couple 
these (and other) somewhat technical concerns with concerns regarding equity 
and international law and politics, and a considerable hurdle results. The interested 
reader may wish to consult the lAB's review of IDNs [RFC4690], published in 2006, 
for more information. Current information is available from [IIDN]. 

The gTLDs are grouped into categories: generic, generic-restricted, and sponsored. 
The generic gTLDs {generic appears twice) are open for unrestricted use. The others 
(generic-restricted and sponsored) are limited to various sorts of uses or are con¬ 
strained as to what entity may assign names from the domain. For example, EDU is 
used for educational institutions, MIL and GOV are used for military and govern¬ 
ment institutions of the United States, and INT is used for international organiza¬ 
tions (such as NATO). Table 11-1 provides a summary of the 22 gTLDs from [GTLD] 
as of mid-2011. There is a "new gTLD" program in the works that may significantly 
expand the current set, possibly to several hundred or even thousand. This pro¬ 
gram and policies relating to TLD management in general are maintained by the 
Internet Gorporation for Assigned Names and Numbers (IGANN) [IGANN]. 


1. Figure 11-1 also shows 11 test IDN domains, which are still available. 
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Figure 11-1 The DNS name space forms a hierarchy with an unnamed root at the top. The top-level domains (TLDs) include generic TLDs (gTLDs), country- 
code TLDs (ccTLDs), internationalized TLDs (IDN ccTLDs), and a special infrastructure TLD called ARPA. 
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Table 11-1 The generic top-level domains (gTLDs), circa 2011 


TLD 

First Use (est.) 

Use 

Example 

AERO 

December 21, 2001 

Air-transport industry 

www.sita.aero 

AREA 

January 1,1985 

Infrastructure 

18.in-addr.arpa 

ASIA 

May 2, 2007 

Pan-Asia and Asia Pacific 

www.seo.asia 

BIZ 

June 26, 2001 

Business uses 

neustar.biz 

CAT 

December 19, 2005 

Catalan linguistic/cultural 
community 

www.domini.cat 

COM 

January 1,1985 

Generic 

icanhascheezburger.com 

COOP 

December 15,2001 

Cooperative associations 

www.ems.coop 

EDU 

January 1,1985 

Post-secondary educational 
institutions recognized by U.S.A. 

hpu.edu 

GOV 

January 1,1985 

US. government 

whitehouse.gov 

INFO 

June 25, 2001 

Generic 

germany.info 

INT 

November 3,1988 

International treaty organizations 

nato.int 

JOBS 

September 8,2005 

Fluman resource managers 

intel.jobs 

MIL 

January 1,1985 

US. military 

dtic.mil 

MOBI 

October 30,2005 

Customers/providers of mobile 
products/services 

flowers.mobi 

MUSEUM 

October 30,2001 

Museums 

icom.museum 

NAME 

August 16, 2001 

Individuals 

www.name 

NET 

January 1,1985 

Generic 

ja.net 

ORG 

December 9,2002 

Generic 

slashdot.org 

PRO 

May 6, 2002 

Credentialed professionals/entities 

nic.pro 

TEL 

March 1, 2007 

Contact data for businesses/ 
individuals 

telnic.tel 

TRAVEL 

July 27, 2005 

Travel industry 

cancun.travel 

XXX 

April 15, 2011 

Adult entertainment industry 

whois.nic.xxx 


The ccTLDs include the two-letter country codes specified by the ISO 3166 
standard [IS03166], plus five fhaf are nof: uk, su, ac, eu, and tp (fhe lasf one is 
being phased ouf). Because some of fhese fwo-leffer codes are suggesfive of ofher 
uses and meanings, various counfries have been able fo find commercial wind¬ 
falls from selling names wifhin fheir ccTLDs. For example, fhe domain name cnn. 
tv is really a regisfrafion in fhe Pacific island of Tuvalu, which has been selling 
domain names associafed wifh fhe felevision enferfainmenf indusfry. Creafing a 
name in such an unconvenfional way is somefimes called a domain hack. 

11.2.1 DNS Naming Syntax 

The names below a TLD in fhe DNS name free are furfher parfifioned info groups 
known as subdomains. This is very common pracfice, especially for fhe ccTLDs. For 
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example, most educational sites in England use the suffix . ac.uk, whereas names 
for mosf for-profif companies fhere end in fhe suffix .co.uk. In fhe Unifed Sfafes, 
cify governmenf Web sifes fend fo use fhe subdomain ci .dfy.stefe.us where state 
is fhe fwo-leffer abbreviafion for fhe name of fhe sfafe and city is fhe name of fhe 
cify. For example, fhe sife www.ci.manhattan-beach.ca.us is fhe sife of Man- 
haffan Beach, California's, cify governmenf in fhe Unifed Sfafes. 

The example names we have seen so far are known as fully qualified domain 
names (FQDNs). They are somefimes wriffen more formally wifh a frailing period 
(e.g., mit.edu.). This frailing period indicafes fhaf fhe name is complefe; no 
addifional informafion should be added fo fhe name when performing a name 
resolufion. In confrasf fo fhe FQDN, an unqualified domain name, which is used 
in combinafion wifh a defaulf domain or domain search lisf sef during sysfem 
configurafion, has one or more sfrings appended fo fhe end. When a sysfem is 
configured (see Chapfer 6), if is fypically assigned a defaulf domain exfension and 
search lisf using DHCP (or, less commonly, fhe RDNSS and DNSSL RA opfions). 
For example, fhe defaulf domain cs.berkeley.edu mighf be configured in sys- 
fems af fhe compufer science deparfmenf af UC Berkeley. If a user on one of fhese 
machines fypes in fhe name vangogh, fhe local resolver soffware converfs fhis 
name fo fhe FQDN vangogh.cs.berkeley.edu. before invoking a resolver fo 
defermine vangogh's IP address. 

A domain name consisfs of a sequence of labels separafed by periods. The 
name represenfs a locafion in fhe name hierarchy, where fhe period is fhe hierar¬ 
chy delimifer and descending down fhe free fakes place from righf fo leff in fhe 
name. For example, fhe FQDN 


WWW. net. in. turn. de. 


confains a hosf name label (www) in a four-level-deep domain (net.in.tum.de). 
Sfarfing from fhe roof, and working from righf fo leff in fhe name, fhe TLD is de 
(fhe ccTLD for Germany), turn is shorfhand for Technische Universifaf Miinchen, 
in is shorfhand for informatik (German for "compufer science"), and finally net is 
shorfhand for fhe nefworks group wifhin fhe compufer science deparfmenf. Labels 
are case-insensifive for mafching purposes, so fhe name ACME.COM is equivalenf 
fo acme .com or AcMe.cOm [RFG4343]. Each label can be up fo 63 characfers long, 
and an enfire FQDN is limifed fo af mosf 255 (1-byfe) characfers. For example, fhis 
domain name: 


thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com 


was allegedly submiffed as a pofenfial world record for fhe longesf name, wifh 
a label of lengfh 63, buf was judged fo have been of insufficienf merif fo jusfify a 
place in fhe Guinness World Records. 

The hierarchical sfrucfure of fhe DNS name space allows differenf adminisfra- 
five aufhorifies fo manage differenf parfs of fhe name space. For example, creafing 
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a new DNS name of the form elevator, cs .berkeley. edu would likely require 
dealing with the owner of the cs. berkeley. edu subdomain only. The berkeley 
.edu and edu portions of the name space would not require alteration, so the 
owners of those would not need to be bothered. This feature of DNS is one key 
aspect of its scalability. That is, no single entity is required to administer all the 
changes for the entire DNS name space. Indeed, creating a hierarchical structure 
for names was one of the first responses in the Internet community to the pres¬ 
sures of scaling and a major motivator for the structure used today. The origi¬ 
nal Internet naming scheme was flat (i.e., no hierarchy), and a single entity was 
responsible for assigning, maintaining, and distributing the list of nonconflicting 
names. Over time, as more names were required and more changes were being 
made, this approach became unworkable [MD88]. 


11.3 Name Servers and Zones 

Management responsibility for portions of the DNS name space is assigned to 
individuals or organizations. A person given responsibility for managing part of 
the active DNS name space (one or more domains) is supposed to arrange for at 
least two name servers or DNS servers to hold information about the name space 
so that users of the Internet can perform queries on the names. The collection of 
servers forms the DNS (service) itself, a distributed system whose primary job is 
to provide name-to-address mappings. However, it can also provide a wide array 
of additional information. 

The unit of administrative delegation, in the language of DNS servers, is 
called a zone. A zone is a subtree of the DNS name space that can be administered 
separately from other zones. Every domain name exists within some zone, even 
the TLDs that exist in the root zone. Whenever a new record is added to a zone, the 
DNS administrator for the zone allocates a name and additional information (usu¬ 
ally an IP address) for the new entry and enters these into the name server's data¬ 
base. At a small campus, for example, one person could do this each time a new 
server is added to the network, but in a large enterprise the responsibility would 
have to be delegated (probably by departments or other organizational units), as 
one person likely could not keep up with the work. 

A DNS server can contain information for more than one zone. At any hierar¬ 
chical change point in a domain name (i.e., wherever a period appears), a different 
zone and containing server may be accessed to provide information for the name. 
This is called a delegation. A common delegation approach uses a zone for imple¬ 
menting a second-level domain name, such as berkeley.edu. In this domain, 
there may be individual hosts (e.g., www.berkeley.edu) or other domains (e.g., 
cs.berkeley.edu). Each zone has a designated owner or responsible party who 
is given authority to manage the names, addresses, and subordinate zones within 
the zone. Often this person manages not only the contents of the zone but also the 
name servers that contain the zone's database (s). 
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Zone information is supposed to exist in at least two places, implying that 
there should be at least two servers containing information for each zone. This is 
for redundancy; if one server is not functioning properly, at least one other server 
is available. All of these servers contain identical information about a zone. Typi¬ 
cally, among the servers, a primary server contains the zone database in a disk 
file, and one or more secondary servers obtain copies of the database in its entirety 
from the primary using a process called a zone transfer. DNS has a special protocol 
for performing zone transfers, but copies of a zone's contents can also be obtained 
using other means (e.g., the rsync utility [RSYNC]). 


11.4 Caching 

Name servers contain information such as name-to-IP-address mappings that 
may be obtained from three sources. The name server obtains the information 
directly from the zone database, as the result of a zone transfer (e.g., for a slave 
server), or from another server in the course of processing a resolution. In the first 
case, the server is said to contain authoritative information about the zone and may 
be called an authoritative server for the zone. Such servers are identified by name 
within the zone information. 

Most name servers (except some of the root and TLD servers) also cache zone 
information they learn, up to a time limit called the time to live (TTL). They use this 
cached information to answer queries. Doing so can greatly decrease the amount 
of DNS message traffic that would otherwise be carried on the Internet [J02]. 
When answering a query, a server indicates whether the information it is return¬ 
ing has been derived from its cache or from its authoritative copy of the zone. 
When cached information is returned, it is common for a server to also include the 
domain names of the name servers that can be contacted to retrieve authoritative 
information about the corresponding zone. 

As we shall see, each DNS record (e.g., name-to-IP-address mapping) has its 
own TTL that controls how long it can be cached. These values are set and altered 
by the zone administrator when necessary. The TTL dictates how long a mapping 
can be cached anywhere within DNS, so if a zone changes, there still may exist 
cached data within the network, potentially leading to incorrect DNS resolution 
behavior until expiry of the TTL. For this reason, some zone administrators, antic¬ 
ipating a change to the zone contents, first reduce the TTL before implementing 
the change. Doing so reduces the window for incorrect cached data to be present 
in the network. 

It is worth mentioning that caching is applied both for successful resolutions 
and for unsuccessful resolutions (called negative caching). If a request for a particu¬ 
lar domain name fails to return a record, this fact is also cached. Doing so can help 
to reduce Internet traffic when errant applications repeatedly make requests for 
names that do not exist. Negative caching was changed from optional to manda¬ 
tory by [RFC2308]. 
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In some network configurations (e.g., those using older UNIX-compatible sys¬ 
tems), the cache is maintained in a nearby name server, not in the resolvers resi¬ 
dent in the clients. Placing the cache in the server allows any hosts on the LAN 
that use the nearby server to benefit from fhe server's cache buf implies a small 
delay in accessing fhe cache over fhe local nefwork. In Windows and more recenf 
sysfems (e.g., Linux), fhe clienf can mainfain a cache, and if is made available fo all 
applicafions running on fhe same sysfem. In Windows, fhis happens by defaulf, 
and in Linux, if is a service fhaf can be enabled or disabled. 

On Windows, fhe local sysfem's cache paramefers may be modified by edifing 
fhe following regisfry enfry: 


HKLM\SYSTEM\CurrentControlSet\Services\DNSCache\Parameters 


The DWORD value MaxNegativeCacheTtl gives fhe maximum number of 
seconds fhaf a negafive DNS resulf remains in fhe resolver cache. The DWORD 
value MaxCacheTtl gives fhe maximum number of seconds fhaf a DNS record 
may remain in fhe resolver cache. If fhis value is less fhan fhe TTL of a received 
DNS record, fhe lesser value confrols how long fhe record remains in cache. These 
fwo regisfry keys do nof exisf by defaulf, so fhey musf be creafed in order fo be 
used. 

In Linux and ofher sysfems fhaf supporf if, fhe Name Service Caching Daemon 
(NSCD) provides a clienf-side caching capabilify. If is confrolled by fhe 
/etc/nscd.conf file fhaf can indicafe which fypes of resolufions (for DNS and 
some ofher services) are cached, along wifh some cache paramefers such as TTL 
seffings. In addifion, fhe file /etc/nsswitch.conf confrols how name resolu- 
fion for applicafions fakes place. Among ofher fhings, if can confrol whefher local 
files, fhe DNS protocol (see Secfion 11.5), and/or NSCD is employed for mappings. 


11.5 The DNS Protocol 

The DNS protocol consisfs of fwo main parfs: a query/response profocol used for 
performing queries againsf fhe DNS for parficular names, and anofher profocol 
for name servers fo exchange dafabase records (zone fransfers). If also has a way 
fo nofify secondary servers fhaf fhe zone dafabase has evolved and a zone fransfer 
is necessary (DNS Nofify), and a way fo dynamically updafe fhe zone (dynamic 
updafes). By far, fhe mosf fypical usage is a simple query/response fo look up fhe 
IPv4 address fhaf corresponds fo a domain name. 

Mosf offen, DNS name resolufion is fhe process of mapping a domain name 
fo an IPv4 address, alfhough IPv6 addresses mappings work in essenfially fhe 
same way. DNS query/response operafions are supporfed over fhe disfribufed 
DNS infrasfrucfure consisfing of servers deployed locally af each sife or ISP, and a 
special sef of root servers. There is also a special sef oi generic top-level domain servers 
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used for scaling some of fhe larger gTLDs, including COM and NET. As of mid- 
2011, fhere are 13 roof servers named by fhe leffers A fhrough M (see [ROOTS] for 
more informafion abouf fhem); 9 of fhem have IPv6 addresses. There are also 13 
gTLD servers, also labeled A fhrough M; 2 of fhem have IPv6 addresses. By con- 
facfing a roof server and possibly a gTLD server, fhe name server for any TLD in 
fhe Infernef can be discovered. These servers are mufually coordinafed fo provide 
fhe same informafion. Some of fhem are nof a single physical server buf insfead 
a group of servers (over 50 for fhe / roof server) fhaf use fhe same IP address (i.e., 
using IP anycasf addressing; see Chapfer 2). 

A full resolufion fhaf is unable fo benefif from preexisfing cached enfries fakes 
place among several enfifies, as shown in Figure 11-2. 


ISP DNS Server 



Figure 11-2 A typical recursive DNS query for EXAMPLE. COM from A. HOME involves up to ten messages. The 
local recursive server (GW. HOME here) uses a DNS server provided by its ISP. That server, in turn, 
uses an Internet root name server and a gTLD server (for COM and NET TLDs) to find the name 
server for the EXAMPLE.COM domain. That name server (A. I ANA-SERVERS. NET here) provides 
the required IP address for the host EXAMPLE.COM. All of the recursive servers cache any infor¬ 
mation learned for later use. 


Here, we have a laptop called A.HOME residing nearby the DNS server 
GW.HOME. The domain HOME is private, so it is not known to the Internet—only 
locally at the user's residence. When a user on A.HOME wishes to connect to the 
host EXAMPLE.COM (e.g., because a Web browser has been instructed to access 
the page http://EXAMPLE.COM), A.HOME must determine the IP address for the 
server EXAMPLE. COM. Assuming it does not know this address already (it might 
if it has accessed the host recently), the resolver software on A.HOME first makes 
a request to its local name server, GW.HOME. This is a request to convert the name 
EXAMPLE.COM into an address and constitutes message 1 (labeled on an arrow in 
Figure 11-2). 
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Note 

If the A.HOME system is configured with a default domain search list, there may 
be additional queries. For exampie, if .home is a defauit search domain used by 
A.HOME, the first DNS query may be for the name example.com.home, which 
wiil fail at the gw.home name server, which is authoritative for .home. A subse¬ 
quent query wiil typically remove the default extension, resulting in a query for 
EXAMPLE.COM. 


If GW.HOME does not already know the IP address for EXAMPLE.COM or the 
name servers for either the EXAMPLE.COM domain or the COM TLD, it forwards the 
request to another DNS server (called recursion). In this case, a request (message 
2) goes to an ISP-provided DNS server. Assuming that this server also does not 
know the required address or other information, it contacts one of the root name 
servers (message 3). The root servers are not recursive, so they do not process 
the request further but instead return the information required to contact a name 
server for the COM TLD. For example, it might return the name A.GTLD-SERVERS 
.NET and one or more of its IP addresses (message 4). With this information, the 
ISP-provided server contacts the gTLD server (message 5) and discovers the name 
and IP addresses of the name servers for the domain EXAMPLE.COM (message 6). 
In this case, one of the servers is A. I ANA-SERVERS. NET. 

Given the correct server for the domain, the ISP-provided server contacts the 
appropriate server (message 7), which responds with the requested IP address 
(message 8). At this point, the ISP-provided server can respond to GW.HOME with 
the required information (message 9). GW.HOME is now able to complete the initial 
query and responds to the client with the desired IPv4 and/or IPv6 address(es) 
(message 10). 

From the perspective of A.HOME, the local name server was able to per¬ 
form the request. However, what really happened is a recursive query, where the 
GW.HOME and ISP-provided servers in turn made additional DNS requests to sat¬ 
isfy A. home's query. In general, most name servers perform recursive queries such 
as this. The notable exceptions are the root servers and other TLD servers that do 
not perform recursive queries. These servers are a relatively precious resource, so 
encumbering them with recursive queries for every machine that performs a DNS 
query would lead to poor global Internet performance. 

11.5.1 DNS Message Format 

There is one basic DNS message format [RFC6195]. It is used for all DNS opera¬ 
tions (queries, responses, zone transfers, notifications, and dynamic updates), as 
illustrated in Figure 11-3. 

The basic DNS message begins with a fixed 12-byte header followed by four 
variable-length sections: questions (or queries), answers, authority records, and 
additional records. All but the first section contain one or more resource records 
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0 8 15 


Transaction ID (16 bits) 


Q 

OpCode 

A 

T 

R 

R 

2 

A 

C 

R 

{4 bits) 

A 

C 

D 

A 


D 

D 


QDCOUNT/ZOCOUNT 
(Query Count/Zone Count) 


ANCOUNT/PRCOUNT 
(Answer Count/Prerequisite Count) 

NSCOUNT/UPCOUNT 
(Authority Record Count/Update Count) 

ARCOUNT/ADCOUNT 
(Additional Information Count) 


Sections: Question, Answer, 
Authority, Additional Information 

Sections (Used with DNS UPDATE): 
Zone, Prerequisite, 

Update, Additional Information 

(variable length) 


Flags: 

QR: Query{0)/Response(1) 

AA: Authoritative Answer 
TC: Truncated Answer 
RD: Recursion Desired 
RA: Recursion Available 
Z: Zero 

AD: Authentic Data [RFC4035] 

CD: Checking Disabled [RFC4035] 

Opcodes (common values): 

Query (0) - Regular Query 

Notify (4) - DNS NQTIFY [RFC1996] 

Update (5) - DNS UPDATE [RFC2136] 

RCODEs (common values): 
NoError (0) - No Error 
FormErr (1) - Format Error 
ServFail (2) - Server Failure 
NXDomain (3) - Non-existent Domain 
Notimp (4) - Not Implemented 
Refused (5) - Query Refused 


Figure 11-3 The DNS message format has a fixed 12-byte header. The entire message is usually 
carried in a UDP/IPv4 datagram and limited to 512 bytes. DNS UPDATE (DNS with 
dynamic updates) uses the field names ZOCOUNT, PRCOUNT, UPCOUNT, and 
ADCOUNT. A special extension format (called EDNSO) allows messages to be larger 
than 512 bytes, which is required for DNSSEC (see Chapter 18). 


(RRs), which we discuss in detail in Section 11.5.6. (The question section contains 
a data item that is very close in structure to an RR.) RRs can be cached; questions 
are not. 

In the fixed-length header, the Transaction ID field is sef by fhe clienf and 
refurned by fhe server. If lefs fhe clienf mafch responses fo requesfs. The second 
16-bif word includes a number of flags and of her sub fields. Beginning from fhe 
leff-mosf bif, QR is a 1-bif field: 0 means fhe message is a query; 1 means if is a 
response. The nexf is fhe OpCode, a 4-bif field. The normal value is 0 (a sfandard 
query) for requesfs and responses. Ofher values are: 4 (nofify), and 5 (updafe). 
Ofher values (1-3) are deprecafed or never seen in operafional use. Nexf is fhe AA 
bif field fhaf indicafes an "aufhorifafive answer" (as opposed fo a cached answer). 
TC is a 1-bif field fhaf means "fruncafed." Wifh UDP, fhis flag being sef means fhaf 
fhe fofal size of fhe reply exceeded 512 byfes, and only fhe firsf 512 byfes of fhe 
reply were refurned. 

RD is a bif field fhaf means "recursion desired." If can be sef in a query and 
is fhen refurned in fhe response. If fells fhe server fo perform a recursive query. 
If fhe bif is nof sef, and fhe requesfed name server does nof have an aufhorifafive 
answer, fhe requesfed name server refurns a lisf of ofher name servers fo confacf 
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for the answer. At this point, the overall query may be continued by contacting 
the list of other name servers. This is called an iterative query. RA is a bit field that 
means "recursion available." This bit is set in the response if the server supports 
recursion. Root servers generally do not support recursion, thereby forcing clients 
to perform iterative queries to complete name resolution. The Z bit field must be 0 
for now but is reserved for future use. 

The AD bit field is set to true if the contained information is authenticated, 
and the CD bit is set to true if security checking is disabled (see Chapter 18). The 
Response Code (or RCODE) field is a 4-bit field with the return code whose possible 
values are given in [DNSPARAM]. The common values include 0 (no error) and 
3 (name error or "nonexistent domain," written as NXDOMAIN). A list of the 
first 11 error codes is given in Table 11-2 (values 11 through 15 are unassigned). 
Additional types are defined using a special extension (see Section 11.5.2). A 
name error is returned only from an authoritative name server and means that the 
domain name specified in the query does not exist. 


Table 11-2 The first ten error types used with the RCODE field 


Value 

Name 

Reference 

Description and Purpose 

0 

NoError 

[RFC1035] 

No error 

1 

FormErr 

[RFC1035] 

Format error; query cannot be interpreted 

2 

ServFail 

[RFC1035] 

Server failure; error in processing at server 

3 

NXDomain 

[RFC1035] 

Nonexistent domain; unknown domain referenced 

4 

Notlmp 

[RFC1035] 

Not implemented; request not supported in server 

5 

Refused 

[RFC1035] 

Refused; server unwilling to provide answer 

6 

YXDomain 

[RFC2136] 

Name exists but should not (used with updates) 

7 

YXRRSet 

[RFC2136] 

RRSet exists but should not (used with updates) 

8 

NXRRSet 

[RFC2136] 

RRSet does not exist but should (used with updates) 

9 

NotAuth 

[RFC2136] 

Server not authorized for zone (used with updates) 

10 

NotZone 

[RFC2136] 

Name not contained in zone (used with updates) 


The next four fields are 16 bits in size and specify the number of entries in the 
question, answer, authority, and additional information sections that complete the 
DNS message. For a query, the number of questions is normally 1 and the other 
three counts are 0. For a reply, the number of answers is at least 1. Questions have 
a name, type, and class. (Class supports non-Internet records, but we ignore this 
for our purposes. The type identifies the type of object being looked up.) All of 
the other sections contain zero or more RRs. RRs contain a name, type, and class 
information, but also the TTL value that controls how long the data can be cached. 
We shall discuss the most important RR types in detail once we have a look at how 
DNS encodes names and selects which transport protocol to use when carrying 
DNS messages. 






Section 11.5 The DNS Protocol 


523 


11.5.1.1 Names and Labels 

The variable-length sections at the end of a DNS message contain a collection of 
quesfions, answers, aufhorify informafion (names of name servers fhaf confain 
aufhorifafive informafion for cerfain dafa), and addifional informafion fhaf may 
be useful fo reduce fhe number of necessary queries. Each quesfion and each RR 
begins wifh a name (called fhe domain name or owning name) fo which if refers. 
Each name consisfs of a sequence of labels. There are fwo cafegories of label fypes: 
data labels and compression labels. Dafa labels confain characfers fhaf consfifufe a 
label; compression labels acf as poinfers fo ofher labels. Compression labels help fo 
save space in a DNS message when mulfiple copies of fhe same sfring of characfers 
are presenf across mulfiple labels. 

11.5.1.2 Data Labels 

Each dafa label begins wifh a 1-byfe counf fhaf specifies fhe number of byfes fhaf 
immediafely follow. The name is ferminafed wifh a byfe confaining fhe value 0, 
which is a label wifh a lengfh of 0 (fhe label of fhe roof). Eor example, fhe encoding 
of fhe name www.pearson.com would be as shown in Eigure 11-4. 
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Figure 11-4 DNS names are encoded as a sequence of labels. This example encodes the name www. 

pearson.com, which (technically) has four labels. The end of the name is identified by 
a 0-length label of the nameless root. 


Eor dafa labels, each label Length byfe musf be in fhe range of 0 fo 63, as labels 
are limifed fo 63 byfes. No padding is used for labels, so fhe fofal name lengfh 
could be odd. Alfhough fhese labels are somefimes called "fexf" labels, fhey are 
capable of confaining non-ASCII values. This use, however, is uncommon and 
nof recommended. Indeed, even fhe infernafionalized domain names, which can 
encode Unicode characfers [REC5890][REC5891], use a curious encoding synfax 
called "punycode" [REC3492] fhaf expresses Unicode characfers using fhe ASCII 
characfer sef. To be complefely safe, if is recommended fo follow fhe requiremenfs 
in [REC1035], which suggesf fhaf labels "sfarf wifh a leffer, end wifh a leffer or 
digif, and have as inferior characfers only letters, digifs and hyphen." 

11.5.1.3 Compression Labels 

In many cases, a DNS response carries informafion in fhe answer, aufhorify, and 
addifional informafion secfions relafing fo fhe same domain name. If dafa labels 
were used, fhe same characfers would be repealed in fhe DNS message when refer¬ 
ring fo fhe same name. To avoid fhis redundancy and save space, a compression 
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scheme is used. Anywhere the label portion of a domain name can occur, the 
single preceding count byte (which is normally between 0 and 63) instead has its 
2 high-order bits turned on, and the remaining bits are combined with the bits in 
the subsequent byte to form a 14-bif poinfer (offsef) in fhe DNS message. The offsef 
gives fhe number of byfes from fhe beginning of fhe DNS message where a dafa 
label (called fhe compression target) is fo be found fhaf should be subsfifufed for fhe 
compression label. Compression labels are fhus able fo poinf fo a locafion up fo 
16,383 byfes from fhe beginning. Figure 11-5 illusfrafes how we mighf encode fhe 
domain names use. edu and ucla. edu using compression labels. 
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Figure 11-5 A compression label can reference other labels to save space. This is accomplished by 
setting the 2 high-order bits of the byte preceding the label contents. This signals that 
the following 14 bits are used in providing an offset for the replacement label. In this 
example, usc.edu and ucla.edu share the edu label. 


In Figure 11-5 we see how fhe common label edu can be shared by fhe fwo 
domain names. Assuming fhe names sfarf af offsef 0, dafa labels are used fo 
encode usc.edu as described previously. The nexf name is ucla.edu, and fhe 
label ucla is encoded using a dafa label. However, fhe label edu may be reused 
from fhe encoding of usc.edu. This is accomplished by setting fhe 2 high-order 
bifs of fhe label Type byfe fo 1 and encoding fhe offsef of edu in fhe remaining 14 
bifs. Because fhe firsf occurrence of edu is af offsef 4, we only need fo sef fhe firsf 
byfe fo 192 (6 bifs of 0) and fhe nexf byfe fo 4. The example in Figure 11-5 shows a 
savings of only 4 byfes, buf if is clear how compression of larger common labels 
can resulf in more subsfanfial savings. 

11.5.2 The DNS Extension Format (EDNSO) 

The basic DNS message format described so far can be restrictive in a number of 
ways. It has fixed-length fields, a total length limitation of 512 bytes when used 
with UDP (not including UDP or IP headers), and limited space (the 4-bit RCODE 
field) for indicating error types. An extension mechanism called EDNSO (because 
there could be future extensions beyond the index 0) is specified in [RFC2671]. 
While its use is not ubiquitous at present, it is necessary for supporting DNS secu¬ 
rity (DNSSEC; see Chapter 18), so it is likely to receive more widespread deploy¬ 
ment over time. 



Section 11.5 The DNS Protocol 


525 


EDNSO specifies a particular type of RR (called an OPT pseudo-RR or meta-RR) 
that is added to the additional data section of a request or response to indicate 
the use of EDNSO. At most one such record may be present in any DNS message. 
We will discuss the particular format of an OPT pseudo-RR when we discuss the 
other RR types in Section 11.5.6. Eor now, the important thing to note is that if a 
UDP DNS message includes an OPT RR, it is permitted to exceed the 512-byte 
length limitation and may contain an expanded set of error codes. 

EDNSO also defines an extended label type (extending beyond the data labels 
and compression labels mentioned earlier). Extended labels have their first 2 bits 
in the label Type/Length byte set to 01, corresponding to values between 64 and 
127 (inclusive). An experimental binary labeling scheme (type 65) was used at one 
time but is now not recommended. The value 127 is reserved for future use, and 
values above 127 are unallocated. 


11.5.3 UDP or TCP 

The well-known port number for DNS is 53, for both UDP and TCP. The most com¬ 
mon format uses the UDP/IPv4 datagram structure shown in Pigure 11-6. 
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Figure 11-6 DNS messages are typically encapsulated in a UDP/lPv4 datagram and are limited to 
512 bytes in size unless TCP and/or EDNSO is used. Each section (except the question 
section) contains a set of resource records. 


When a resolver issues a query and the response comes back with the TC bit 
field set ("truncated"), the size of the true response exceeded 512 bytes, so only 
the first 512 bytes are returned by the server. The resolver may issue the request 
again, using TCP, which now must be a supported configuration [RPC5966]. This 
allows more than 512 bytes to be returned because TCP breaks up large messages 
into multiple segments. 

When a secondary name server for a zone starts up, it normally performs 
a zone transfer from the primary name server for the zone. Zone transfers can 
also be initiated by a timer or as a result of a DNS NOTIPY message (see Sec¬ 
tion 11.5.8.3). Pull zone transfers use TCP as they can be large. Incremental zone 
transfers, where only the updated entries are transferred, may use UDP at first but 
switch to TCP if the response is too large, just like a conventional query. 
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When UDP is used, both the resolver and the server application software must 
perform fheir own fimeouf and refransmission. A recommendafion for how fo do 
fhis is given in [RFC1536]. If suggesfs sfarfing wifh a fimeouf of af leasf 4s, and fhaf 
subsequenf fimeoufs resulf in an exponenfial increase of fhe fimeouf (a bif like 
TCP's algorifhms; see Chapfer 14). Linux and UNIX-like sysfems allow a change fo 
be made fo fhe refransmission fimeouf paramefers by alfering fhe confenfs of fhe 
/etc/resolv.conf file (by selling fhe timeout and attempts opfions). 

11.5.4 Question (Query) and Zone Section Format 

The question or query section of a DNS message lists the question(s) being refer¬ 
enced. The format of each question in the question section is shown in Figure 11-7. 
There is normally just one, although the protocol can support more. The same struc¬ 
ture is also used for the zone section in dynamic updates (see Section 11.5.7), but 
with different names. 
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(Zone NAME for DNS UPDATE) 


Query Type (16 bits) 

(Zone Type for DNS UPDATE) 

Query Class (16 bits) 

(Zone Class for DNS UPDATE) 


Figure 11-7 The query (or question) section of a DNS message does not contain a TTL because it is 
not cached. 


The Query Name is the domain name being looked up, using the encoding for 
labels we described before. Each question has a Query Type and Query Class. The 
class value is 1, 254, or 255, indicating the Internet class, no class, or all classes, 
respectively, for all cases in which we are interested (other values are not typically 
used for TCP/IP networks). The Query Type field holds a value indicating the type 
of query being performed using the values from Table 11-2. The most common 
query type is A (or AAAA if IPv6 DNS resolution is enabled), which means that 
an IP address is desired for the query name. It is also possible to create a query 
of type ANY, which returns all RRs of any type in the same class that match the 
query name. 

11.5.5 Answer, Authority, and Additional Information Section Formats 

The final three sections in the DNS message, the answer, authority, and additional 
information sections, contain sets of RRs. RRs in these sections can, for the most 
part, have wildcard domain names as owning names. These are domain names 
in which the asterisk label—a data label containing only the asterisk character 
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[RFC4592]—appears first (i.e., leftmost). Each resource record has the form shown 
in Figure 11-8. 


0 8 15 



Figure 11-8 The format of a DNS resource record. For DNS in the Internet, the Class field always 
contains the value 1. The TTL field gives the maximum amount of time the RR can be 
cached (in seconds). 


The Name field (sometimes called the "owning name," "owner," or "record 
owner's name") is the domain name to which the following resource data cor¬ 
responds. It is in the same format we described earlier for names and labels. The 
Type field specifies one of the RR type codes (see Section 11.5.6). These are the 
same as the query type values we described earlier. The Class field is 1 for Internet 
data. The TTL field is the number of seconds for which the RR can be cached. The 
Resource Data Length (RDLENGTLL) field specifies the number of bytes contained 
in the Resource Data {RDATA) field. The format of this data depends on the type. 
For example, A records (type 1) have a 32-bit IPv4 address in the RDATA area. We 
discuss other RR types later. 

[RFC2181] defines the term Resource Record Set (RRSet) to be a set of resource 
records that share the same name, class, and type but not the same data. This 
occurs, for example, when a host has more than one address record for its name 
(e.g., because it has more than one IP address). TTLs for RRs in the same RRSet 
must be equal. 

11.5.6 Resource Record Types 

Although DNS is most commonly used to determine the IP address (es) that cor¬ 
respond to a particular name, it can also be used for the opposite purpose and for 
a number of other things. It can be used with both IPv4 and IPv6 and can even 
provide a distributed database function for other than Internet data (other classes. 
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in DNS terminology [RFC6195]). The wide range of capabilities provided by DNS 
is largely attributable to its ability to have different types of resource records. 

There are many types of resource records (see [DNSPARAMS] for the com¬ 
plete list), and a single name may have multiple matching RRs. Table 11-3 provides 
a listing of the most common RR types used with conventional DNS (i.e., DNS 
without the DNSSEC security extensions). 


Table 11-3 The popular resource record and query types used in DNS protocol messages. Additional records 
(not shown) are used when DNS security (DNSSEC) is employed. 


Value 

RR Type 

Reference 

Description and Purpose 

1 

A 

[RFC1035] 

Address record for IPv4 (32-bit IPv4 address) 

2 

NS 

[RFC1035] 

Name server; provides name of authoritative name server 
for zone 

5 

CNAME 

[RFC1035] 

Canonical name; maps one name to another (to provide a 
form of name aliasing) 

6 

SOA 

[RFC1035] 

Start of authority; provides authoritative information for the 
zone (name servers, e-mail address of contact, serial number, 
zone transfer timers) 

12 

PTR 

[RFC1035] 

Pointer; provides address to (canonical) name mapping; 
used with in-addr.arpa and ip6.arpa domains for IPv4 
and IPv6 reverse queries 

15 

MX 

[RFC1035] 

Mail exchanger; provides name of e-mail handling host for 
a domain 

16 

TXT 

[RFC1035] 

[RFC1464] 

Text; provides a variety of information (e.g., used with SPF 
anti-spam scheme to identify authorized e-mail servers) 

28 

AAAA 

[RFC3596] 

Address record for IPv6 (128-bit IPv6 address) 

33 

SRV 

[RFC2782] 

Server selection; transport endpoints of a generic service 

35 

NAPTR 

[RFC3403] 

Name authority pointer; supports alternative name spaces 

41 

OPT 

[RFC2671] 

Pseudo-RR; supports larger datagrams, labels, return codes 
in EDNSO 

251 

IXFR 

[RFC1995] 

Incremental zone transfer 

252 

AXFR 

[RFC1035] 

[RFC5936] 

Eull zone transfer; carried over TCP 

255 

(ANY) 

[RFC1035] 

Request for all (any) records 


Resource records are used for many purposes but can be divided into three 
broad categories: data types, query types, and meta types. Data types are used 
to convey information stored in the DNS such as IP addresses and the names of 
authoritative name servers. Query types use the same values as data types, with 
a few additional values (e.g., AXFR, IXFR, and *). They can be used in the ques¬ 
tion section we described previously. Meta types designate transient data associ¬ 
ated with a particular single DNS message. The OPT RR is the only meta type we 
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discuss in this chapter (all others are covered in Chapter 18). The most common 
data-type RRs include A, NS, SOA, MX, CNAME, PTR, TXT, AAAA, SRV, and 
NAPTR. The NS records are used to relate the DNS name space to the servers that 
perform resolution, and they contain the names of aufhorifafive name servers for 
a zone. The A and AAAA records are used fo provide an IPv4 or IPv6 address, 
respecfively, given a parficular name. The CNAME record provides a way fo have 
an alias for anofher domain name. SRV and NAPTR records help applicafions fo 
discover fhe locafion of servers supporfing parficular services, and fo use alferna- 
five naming schemes (beyond DNS) fo access such services. We shall explore each 
of fhese record fypes in fhe following secfions. 

11.5.6.1 Address (A, AAAA) and Name Server (NS) Records 
Arguably, fhe mosf imporfanf records wifhin DNS are fhe address (A, AAAA) and 
name server (NS) records. The A records confain 32-bif IPv4 addresses, and AAAA 
(called "quad-A") records confain IPv6 addresses. An NS record confains fhe name 
of an aufhorifafive DNS server fhaf confains informafion for a parficular zone. 
Because fhe name of a DNS server alone is nof sufficienf fo perform a query, fhe 
IP address (es) of fhese servers is also fypically provided as a so-called glue record 
in fhe addifional informafion secfion of DNS responses. Indeed, such glue records 
are required fo avoid loops whenever fhe names of fhe aufhorifafive name servers 
use fhe same domain name for which fhey are aufhorifafive. (Consider how nsl. 
example .com would be resolved if fhe name server for example.com was nsl. 
example.com.) We can see fhe sfrucfure of A, AAAA, and NS records using fhe 
dig fool provided on mosf Linux/UNIX-like sysfems. Here, we make a requesf for 
records of any type associated with the domain name rfc-editor.org: 


Linux% dig +nostats -t ANY rfc-editor.org 

; <<>> DiG 9.6.0-Pl <<>> +nostats -t ANY rfc-editor.org 
;; global options: +cmd 
;; Got answer: 

;; ->>HEADER«- opcode: QUERY, status: NOERROR, id: 53052 
;; flags: qr rd ra; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 2 

;; QUESTION SECTION: 

;rfc-editor.org. IN ANY 

;; ANSWER SECTION: 

rfc-editor.org. 1654 IN AAAA 2001:1890:1112:1::2f 
rfc-editor.org. 1654 IN A 64.170.98.47 
rfc-editor.org. 1654 IN NS nsO.ietf.org. 
rfc-editor.org. 1654 IN NS nsl.hkgl.afilias-nst.info. 

;; ADDITIONAL SECTION: 

nsO.ietf.org. 756 IN A 64.170.98.2 

nsO.ietf.org. 756 IN AAAA 2001:1890:1112:1::14 
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In the command's output, the first two lines indicate the version of fhe dig 
program being used and fhe opfions provided fo if, plus implied opfions (+cmd 
means fhaf fhis informafion ifself should be prinfed). The nexf porfion indicafes 
dafa in fhe DNS reply message: fhe QUERY opcode, NOERROR sfafus indicafing no 
errors were encounfered, and a fransacfion ID of 53052. In fhe OpCode field, QUERY 
is used for bofh queries and responses. Nexf, fhe flags line indicafes fhaf fhe mes¬ 
sage is a query response (gr flag) and nof a query and fhaf recursion was desired in 
fhe original query (rd flag) and is provided by fhe responding server (ra flag). The 
message confains a secfion wifh one query, and 12 resource records in fhe answer 
secfion (only 4 are shown). There are no RRs in fhe aufhorify secfion, meaning fhaf 
fhis response is likely from a caching server (fhe RRs are nof aufhorifafive). Differ- 
enf resulfs mighf be obfained by inferacfing wifh differenf servers. The addifional 
informafion secfion confains IPv4 and IPv6 addresses for one of fhe aufhorifafive 
servers, should we wish fo confacf if. The quesfion secfion confains a copy of our 
original query: fype ANY for domain name rfc-editor.org. 

Among fhe four RRs in fhe answer secfion shown, we find one A fype, one 
AAAA fype, and fwo NS fypes. From fhis informafion we can see fhaf fhe domain 
name rfc-editor.org is a hosf wifh IPv4 address 64.170.98.47 and IPv6 
address 2001:1890:1112:l::2f. Ifisalsoa subdomain, as indicaf ed by fhe pres¬ 
ence of fhe NS records. We can quickly guess and verify fhaf fhere is af leasf one 
hosf in fhis subdomain using fhe following command: 


Linux% host ftp.rfc-editor.org 

ftp.rfc-editor.org has address 64.170.98.47 

This example indicafes a few inferesfing aspecfs of A, AAAA, and NS records. 
Firsf, if is possible for a single domain name fo have records of each of fhese fypes 
(and more). This is fairly common for IPv6-capable servers fhaf are fhe "well- 
known" servers for a parficular organizafion. We can also see fhaf each record has 
a TTL value, and fhey differ considerably, excepf for fhose in fhe same RRSef. The 
TTL for fhe records in fhe answer secfion is 1654s (abouf half an hour), and fhe 
TTL for records in fhe addifional informafion secfion is 756s (abouf 12 minufes). 
Nofe fhaf fhe TTL value of a cached record is never more fhan fhe TTL of fhe same 
record refrieved from fhe aufhorifafive source. TTLs for cached records "decay" 
unfil fhe record is refrieved again from an aufhorifafive server. As a resulf, refriev- 
ing a cached record mulfiple fimes from fhe same server usually shows a decreas¬ 
ing TTL value. 

11.5.6.2 Example 

Now fhaf we have seen fhe DNS message formaf, fransporf protocol opfions, and 
RR fypes for basic queries and responses, lef us see an example. We sfarf wifh a 
simple case fo see fhe communicafion befween a resolver on a clienf, a local name 
server, and a remofe name server managed by an ISP. This scenario demonsfrafes 
fhe imporfance of caching in DNS. The fopology is shown in Figure 11-9. 
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Figure 11-9 A simple DNS query/response example. The local DNS server (GW.HOME) provides recursion to 
the client (A.home), and uses the DNS server provided at the ISP when requested data is not pres¬ 
ent in the cache. 


On our Windows client (A.HOME) we begin with a command that removes 
any DNS data cached by the resolver libraries. We then perform a query for fhe 
address (A record type) of fhe domain name berkeley.edu: 


C:\> ipconfig /flushdns 

Windows IP Configuration 

Successfully flushed the DNS Resolver Cache. 

C:\> nslookup 
Default Server: gw 
Address: 10.0.0.1 

> set type=a 

> berkeley.edu. 

Server: gw 

Address: 10.0.0.1 

Non-authoritative answer: 

Name: berkeley.edu 

Address: 169.229.131.81 

The firsf command is specific fo Windows and removes dafa cached by fhe cli- 
enf's resolver soffware. The nslookup program, available on bofh Windows and 
Linux/UNlX-based sysfems, provides a basic way fo query fhe DNS for specific 
dafa. Upon execufion, if indicafes which name server if is using for resolufion (here 
fhe server is gw af fhe address 10.0.0.1). Using fhe set command, we arrange fo 
query for A records, and fhen query for fhe name berkeley.edu.. Once again, 
nslookup indicafes which server if uses for fhe resolufion. If fhen also gives us 
an indicafion fhaf fhe answer is nonaufhorifafive (i.e., if is being provided by a 
caching server) and fhe requesfed address is 169.229.131.81. 

To see whaf happens wifh fhe DNS protocol af fhe packef level, we use Wire- 
shark and have a look af fhe firsf packef in defail, as shown in Figure 11-10. 
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Figure 11-10 A UDP/IPv4 datagram containing a DNS standard query for the IPv4 address associ¬ 
ated with berkeley.edu.. 


There are two messages in the trace: a standard query and a standard query 
response. In the first message (the query), the source IPv4 address is 10.0.0.120 (a 
DHCP-assigned address at the client; see Chapter 6), and the destination is 10.0.0.1 
(the DNS server). The query is a UDP/IPv4 datagram with source port 56288 (an 
ephemeral port) and destination port 53 (the well-known DNS port). In terms of ifs 
full encapsulafion, fhe requesf is an Efhernef frame confaining 72 byfes. This size 
can be derived by summing fhe following parfs: Efhernef header (14 byfes), IPv4 
header (20 byfes), UDP header (8 byfes), DNS fixed header (12 byfes), query fype (2 
byfes), query class (2 byfes), plus fhe dafa labels for berkeley (9 byfes) and edu 
(4 byfes), plus fhe frailing 0 byfe. 

Turning fo fhe defails of fhe DNS header, fhe fransacfion ID is 0x0002 and 
forms fhe firsf 2 byfes of fhe DNS header, locafed af fhe sfarf of fhe UDP payload. 
Only a single flag (recursion requesfed, fhe defaulf) is sef, so fhis message is a 
query. The message confains a sfandard query wifh one quesfion. The ofher sec- 
fions are empfy. The quesfion ifself is for fhe name berkeley.edu and is seeking 
informafion of type A (address records) in the IN (Internet) class. After receiv¬ 
ing this message, the name server process running on 10.0.0.1, unable to directly 
respond because it does not know the address, forwards fhe query fo fhe nexf 
(upsfream) name server if is configured fo use. In fhis parficular case, fhaf name 
server is af fhe address 206.13.28.12 (see Pigure 11-11). 

In Eigure 11-11 we see a query similar fo fhe one senf by fhe clienf, buf in fhis 
case fhe source IPv4 address is 70.bl.l36.162 (fhe ISP-side IPv4 address of GW. HOME). 
The desfinafion address is 206.13.28.12, fhe IPv4 address of fhe ISP-provided DNS 
server, and fhe source porf is an ephemeral porf on fhe local DNS server (60961). 
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0.014203 

206.13.28.12 
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DNS 

Standard 
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V 


> 


(£ Frame 1: 74 bytes on wire (592 bits), 74 bytes captured (592 bits) 

Q Linux cooked capture 

ffl internet Protocol, Src: 70.231.136.162 (70.231.136.162), Dst: 206.13.28.12 (206.13.28.12) 
a user Datagram Protocol, src Port: 60961 (60961), Dst Port: 53 (53) 

S Domain Name system (query) 

[Response in: 21 
Transaction id: 0xb0b8 
GB Flags: 0x0100 (Standard query) 

Questions: 1 
Answer rrs: 0 
Authority RRs: 0 
Additional rrs: 0 
S Queries 

aberke1ey.edu: type a, class in 
N ame: berkeley.edu 
Type: a (Host address) 
class: IN (0x0001) 


Figure 11-11 A DNS request generated at GW.HOME being sent to the ISP name server as a conse¬ 
quence of recursion. 


The transaction ID is generated anew and set to OxbObS. Note that Wireshark indi¬ 
cates that the response to the query is contained in packet number 2. 

Packet 2 in Figure 11-12 is the first DNS response we have seen. First, we note 
that the UDP source port number is 53, but the destination port is the ephemeral 
port number 60961. The transaction ID matches the query (OxbObS), but the Flags 
field now confains fhe value 0x8180 (response, recursion requesfed, and recursion 
available are all sef). The quesfion secfion confains a copy of fhe quesfion for which 
answers are being provided and fypically mafches fhe original query senf by fhe 
clienf exacfly (e.g., case is preserved). There is one RR in fhe answer secf ion. If is of 
fype A (address), has a TTL of 10 minufes and a dafa lengfh of 4 byfes (fhe size of 
an IPv4 address), and fhe value is 169.229.131.81, fhe IPv4 address we requesfed for 
berkeley.edu. Nofe fhaf fhe aufhorify flag is not sef, and fhe aufhorify secfion of 
fhe reply is empfy. This response is based upon cached dafa; if is nof aufhorifafive 
for fhe domain. Af fhis poinf, fhe local name server also caches fhe value (buf only 
for up fo 10 minufes as specified by fhe TTL in fhe RR if received) and responds fo 
fhe requesfing clienf (see Figure 11-13). 

The response in Figure 11-13, packef 2, is much like fhe one from 206.13.28.12, 
excepf if is now senf from 10.0.0.1 fo our original clienf af 10.0.0.120, and fhe frans- 
acfion ID mafches fhe one in fhe original DNS requesf. Nofe also fhaf from fhe cli¬ 
enf's poinf of view fhe enfire round-frip fime of fhe fransacfion was abouf 14.7ms, 
buf we know fhaf mosf of fhaf fime (14.2ms) was faken up in fhe fransacfion 
befween fhe local name server (GW. HOME) and fhe ISP's name server (206.13.28.12). 
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Figure 11-12 A standard DNS response sent from the ISP's DNS server back to GW. HOME. 

11.5.6.3 Canonical Name (CNAME) Records 

The CNAME record stands for canonical name record and is used to introduce an 
alias for a single domain name into the DNS naming system. For example, the 
name www.berkeley.edu may have a CNAME record that maps to some other 
machine (e.g., www.w3.berkeley.edu), so that if the Web server is located at a 
different computer, a relatively simple change to the DNS database may be all that 
is required for the rest of the world to find the new system. It is now common prac¬ 
tice to use CNAME records to establish aliases for common services. As a result, 
names such as www.berkeley.edu, ftp.sun.com, mail.berkeley.edu, and 
www.ucsd.edu are all CNAME entries in the DNS that refer to other RRs. 

Within a CNAME RR, the RDATA section contains the "canonical name" asso¬ 
ciated with the domain name (alias). Such names use the same type of encoding as 
other names (e.g., data labels and compression labels). When a CNAME RR is pres¬ 
ent for a particular name, no other data is permitted [RFC1912] (unless DNSSEC 
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111 Frame 2: 88 bytes on wire (704 bits), 88 bytes captured (704 bits) 

E Ethernet ii, src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d:91) 
a internet Protocol, src: 10.0.0.1 (10.0.0.1), Dst: 10.0.0.120 (10.0.0.120) 
a User Datagram Protocol, src Port: 53 (53), Dst Port: 56288 (56288) 
a Domain Name System (response) 

[Request in: 11 

[Time: 0.014680000 seconds] 

Transaction ID: 0x0002 

a Flags: 0x8180 (Standard query response, no error) 

Questions: 1 
Answer RRs: 1 
Authority RRs: 0 
Additional rrs: 0 
a Queries 

a berke1ey.edu: type A, class IN 
Name: berke1ey.edu 
Type: A (Host address) 
class: IN (0x0001) 
a Answers 

a berkeley.edu: type A, class IN, addr 169.229.131.81 
Name: berke1ey.edu 
Type: A (Host address) 

Class: IN (0x0001) 

Time to live: 10 minutes 
Data length: 4 

Addr: 169.229.131.81 (169.229.131.81) 

> 


Figure 11-13 A response generated by GW. HOME and destined for the client. This message completes 
the recursive DNS transaction. 


is in use; see Chapter 18). Domain names of CNAME RRs may not be used in all 
places that regular domain names can (e.g., as the target of an NS RR). Also, the 
canonical name may itself be a CNAME (called CNAME chaining), but this is usu¬ 
ally discouraged, as it can cause DNS resolvers to make more queries than would 
otherwise be necessary Nonetheless, there are certain services that make use of 
this feature. Eor example, the high-volume site www.whitehouse .gov (at the time 
of writing) uses a content delivery network (CDN)^ provided by the Akamai Corpora¬ 
tion. When we look up this domain name, we find the following: 

Linux% host -t any www.whitehouse.gov 

www.whitehouse.gov is an alias for www.whitehouse.gov.edgesuite.net. 

Linux% host -t any www.whitehouse.gov.edgesuite.net 

www.whitehouse.gov.edgesuite.net is an alias for all28.h.akamai.net. 

Linux% host -t any all28.h.akamai.net 

all28.h.akamai.net has address 92.123.65.42 
all28.h.akamai.net has address 92.123.65.51 


2. A content delivery network typically includes a number of synchronized content caches located in 
particular topological locations in the network. CDNs attempt to minimize latency for consumers 
accessing content in exchange for payment from content providers. 
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Thus, CNAME chains can be used with DNS. However, because of their 
potential performance impact, such chains are often limited by resolvers to a few 
"links" (such as five). Long chains are likely the result of an error in execution or 
a misunderstanding, as it is hard to imagine why they should be necessary under 
normal circumstances. 


Note 

There is a standard resource record called DNAME (type 39) [RFC2672][IDDN]. 
DNAME records act like CNAME records but for an entire zone. For example, 
all names of the form NAME.example.com could be mapped to NAME.newex- 
ample.com using a single DNAME resource record. However, DNAME records 
do not apply to the top-level record Itself (example.com here). 


11.5.6.4 Reverse DNS Queries: PTR (Pointer) Records 

Although the most critical function of DNS is to provide mappings from names 
to IP addresses, there are many circumstances where the reverse mapping is 
required. For example, a server receiving an incoming TCP/IP connection request 
is able to ascertain the source IP address of the connection from the incoming IP 
datagram, but the name(s) corresponding to the address are not carried in the con¬ 
nection itself; such name(s) must be looked up in some other way. Fortunately, a 
clever use of the DNS can provide this capability. 

The PTR RR type is used in response to reverse DNS queries, which are typi¬ 
cally necessary when converting an IP address to a name. This uses the special 
in-addr.arpa (ip6.arpa for IPv6) domain, in a special way. Consider an IPv4 
address such as 128.32.112.208. In the classful address structure (see Chapter 2), 
this address is taken from the 128.32 class B address space. To determine the name 
corresponding to this address, the address is first reversed, and then the special 
domain is added. In this example, a query for a PTR record using the name 


208.112.32.128.in-addr.arpa. 


would be used. In effect, this is a query for the "host" 208 in the "domain" 
112.32.128.in-addr.arpa.. We shall see more examples of reverse DNS que¬ 
ries later in this section. 


Note 

The regular DNS name space, which usually uses NS, A, and AAAA records. 
Is not automatically linked with the “reverse” name space supported by PTR 
records. Thus it is possible (and even relatively common) to have an existing 
forward resolution that does not have a corresponding reverse mapping set up 
(or has a different one). Some services check to see that both directions are set 
up with equivalent mappings and may deny service under such circumstances. 
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Recall that IPv4 addresses are typically written in the "dotted-decimal" for¬ 
mat and IPv6 addresses are written in the hex format (e.g., 169.229.131.81 and 
2001:503:a83e::2:30, respectively). These addresses can be thought of as names 
exisfing in a leff-fo-righf hierarchy. For example, fhe address 169.229.131.81 has 
fhe fop-down hierarchy (reading leff fo righf) 169, 229, 131, 81. By reversing fhe 
doffed-decimal IPv4 address and freafing if as a DNS name, we can employ DNS 
fo perform fhe mapping from IP address fo name(s). So, fhe name 81.131.229.169 
would effecfively be fhe reversal of fhe IPv4 address 169.229.131.81. For IPv6, fhe 
scheme is similar, buf any suppressed zeros are expanded, and each hexadecimal 
digif becomes a characfer. For example, fhe reversal of 2001:503:a83e::2:30 would 
be 0.3.0.0.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.3.8.a.3.0.5.0.1.0.0.2. Forfunafely, users rarely 
have fo fype in fhese names direcfly. 

As menfioned previously, fhe special domains . in-addr. arpa (for IPv4) and 
. ip6. arpa (for IPv6) are used in conjuncfion wifh fhe PTR ("poinfer") RR fype in 
supporf of fhese fypes of names and reverse DNS lookups. For example, consider 
fhe following commands: 


C:\> nslookup 
Default Server: gw 
Address: 10.0.0.1 

> server c.in-addr-servers.arpa 

Default Server: c.in-addr-servers.arpa 
Address: 196.216.169.10 

> set type=ptr 

> 81.131.229.169.in~addr.arpa. 

Server: c.in-addr-servers.arpa 

Address: 196.216.169.10 


169.in-addr.arpa 
169.in-addr.arpa 
169.in-addr.arpa 
169.in-addr.arpa 
169.in-addr.arpa 
169.in-addr.arpa 
169.in-addr.arpa 
169.in-addr.arpa 


nameserver 

nameserver 

nameserver 

nameserver 

nameserver 

nameserver 

nameserver 

nameserver 


w. arin.net 

t. arin.net 
dill.arin.net 
X.arin.net 

z.arin.net 
y.arin.net 

u. arin.net 
V.arin.net 


This example shows how fhe .in-addr.arpa domain is sef up. According 
fo [RFC5855], fhe in-addr-servers.arpa and ip6-servers.arpa domains 
are used in forming fhe domain names associafed wifh fhe servers fhaf provide 
reverse DNS mappings for IPv4 and IPv6, respecfively. As of 2011, fhere are five 
such servers for each version of IP: X. in-addr-servers.arpa and X. ip6- 
servers. arpa, where Xis any leffer a fhrough f (inclusive). 

Alfhough fhe fen servers we have menfioned confain aufhorifafive dafa for 
reverse mappings, fhey do nof confain fhe informafion we are looking for. In our 
example, fhe firsf server confacfed insfead fold us fo confacf one of fhe eighf name 
servers mainfained by ARIN, fhe American Regisfry for Infernef Numbers, which 
is aufhorifafive for IPv4 addresses fhaf sfarf wifh 169. If we in furn confacf one 
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of these servers, we find that a PTR query for 81.131.229.169.in-addr.arpa. 
gives the following response: 

> server w.arin.net 

Default Server: w.arin.net 
Address: 72.52.71. 2 
Default Server: w.arin.net 
Address: 2001:470:1a::2 

> 81.131.229.169.in-addr.arpa. 

Server: w.arin.net 

Address: 72.52.71.2 


229.169.in-addr.arpa nameserver 
229.169.in-addr.arpa nameserver 
229.169.in-addr.arpa nameserver 
229.169.in-addr.arpa nameserver 


adnsl.berkeley.edu. 
phloem.uoregon.edu. 
aodnsl.berkeley.edu 
adns2.berkeley.edu. 


Here we can surmise that the network prefix 169.229/16 is owned by an educa¬ 
tional institution called Berkeley, that the campus maintains three name servers 
covering its in-addr. arpa space, and that the University of Oregon also pro¬ 
vides a copy. Continuing by contacting one of these servers, we find our answer 
(this time using the Linux version of nslookup with slightly different output): 


Linux% nslookup 

> set type=ptr 

> server adnsl.berkeley.edu 

Default Server: adnsl.berkeley.edu 
Address: 128.32.136.3#53 

Default Server: adnsl.berkeley.edu 
Address: 2607:f140:ffff:fffe::3#53 

> 81.131.229.169.in-addr.arpa. 

Server: adnsl.berkeley.edu 

Address: 128.32.136.3#53 

81.131.229.169.in-addr.arpa name = webfarm.Berkeley.EDU 


Here we obtain the result we were looking for, that the IPv4 address 
169.229.131.81 has the name webfarm.Berkeley.EDU. The DNS server uses 
port 53, as indicated by the #53 following the IP addresses. This output makes 
it obvious that accessing the DNS with UDP/IPv4 (as opposed to UDP/IPv6) can 
still provide mappings for IPv6 addresses using "quad-A" (AAAA) DNS records 
because we can see that the IPv6 address of the server is2607:fl40:ffff:fffe::3. 

If there were not a separate branch of the DNS tree for handling the address- 
to-name translation, there would be essentially no way to do the reverse transla¬ 
tion other than starting at the root of the tree and trying every top-level domain. 
This is clearly an unreasonable option, given the current size of the Internet. The 
in-addr.arpa solution is effective and fairly efficient, although the reversed 
bytes of the IPv4/IPv6 address and the special domains can be confusing. 



Section 11.5 The DNS Protocol 


539 


Fortunately, as mentioned before, users can typically avoid having to type or refer 
fo fhem. Even applicafion wrifers do nof fypically have fo manipulafe addresses 
fo perform reverse queries, as library funcfions (such as fhe C library funcfion 
getnameinfo ()) perform fhis fask. 

If is worfh menfioning here fhaf PTR queries have become a significanf con¬ 
cern for fhe global DNS servers. Consider a home nefwork using one of fhe pri- 
vafe address prefixes such as 10.0.0.0/8 (IPv4) or fc00:/7 (IPv6). When a sysfem 
receives an incoming connecfion requesf from anofher sysfem on fhe same pri- 
vafely addressed subnef, if may wish fo resolve fhe source address fo a name and 
does so by performing a PTR query. If fhe query is nof answered by fhe local DNS 
server, if will likely propagafe fo fhe global Infernef. For fhis reason (and a few 
ofhers), [RFC6303] specifies fhaf local name servers—especially fhose operafing 
in nefworks using privafe IP addressing fhaf are affached fo fhe Infernef—provide 
PTR mappings for fhe privafe address space defined in [RFC1918] for IPv4 and 
[RFC4193] for IPv6 (i.e., in IN-ADDR.ARPA and D.F.IP6.ARPA, respecfively). 

11.5.6.5 C/ass/ess in-addr.arpa Delegation 

When organizafions join fhe Infernef and obfain aufhorify fo fill in a porfion of 
fhe DNS name space, fhey offen also obfain aufhorify for a porfion of fhe in- 
addr.arpa name space corresponding fo fheir IPv4 addresses on fhe Infernef. In 
fhe case of UC Berkeley, aufhorify includes fhe nefwork prefix 169.229/16, which, 
using older ferminology, is "class B" nefwork number 169.229. Thus, UC Berkeley 
would be expecfed fo populafe a porfion of fhe DNS free wifh PTR records using 
names ending in 229.169.in-addr.arpa. This works fine for cases where fhe 
address prefix assigned fo fhe organizafion is one of fhe older class A, B, or C 
sfyles where fhe number of bifs is an infegral mulfiple of 8. Plowever, many orga¬ 
nizafions today are given prefix lengfhs of greafer fhan 24 bifs or greater fhan 16 
bifs (buf less fhan 24). In fhese cases, fhe address range is nof easily written as a 
simple reversal of fhe IP address. Instead, some mefhod of conveying fhe nefwork 
prefix lengfh musf be included as well. 

The sfandard mefhod for implemenfing fhis, given by [RFC2317], is fo append 
fhe lengfh of fhe prefix fo fhe reversed ocfefs and use if as fhe firsf label in fhe 
domain name. For example, assume fhaf a sife is assigned fhe prefix 12.17.136.128/25, 
a prefix fhaf includes 128 addresses. According fo [RFC2317], fwo fypes of records 
should be provided. Firsf, for each name of fhe form X. 136.17.12. in-addr.arpa 
(where X is af leasf 128 and nof more fhan 255), a CNAME RR is created, likely 
mainfained by a site's ISP, according fo fhe following pattern: 


128.136.17.12.in-addr.arpa. canonical name = 

128.128/25.136.17.12.in-addr.arpa. 
129.136.17.12.in-addr.arpa. canonical name = 

129.128/25.136.17.12.in-addr.arpa. 

255.136.17.12.in-addr.arpa. canonical name = 

255.128/25.136.17.12.in-addr.arpa. 
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Here we can see how the network prefix is encoded, with the / notation asso¬ 
ciated with the second label in the domain name (for fhis example). These enfries 
are fypically placed by an ISP and allow for delegafions on non-byfe-aligned 
address ranges. In fhis example, fhe customer is now able to provide mappings 
for fhe zone 128.128/25.136.17.12 .in-addr.arpa. We can frace fhe delegafion 
as follows: 

C:\> nslookup 
Default Server: gw 
Address: 10.0.0.1 

> server f.in~addr~servers.arpa 

Default Server: f.in-addr-servers.arpa 
Addresses: 193.0.9.1 

> set type=ptr 

> 129.128/25.136.17.12.in-addr.arpa. 

Server: f.in-addr-servers.arpa 

Address: 193.0.9.1 

12.in-addr.arpa nameserver = dbru.br.ns.els-gms.att.net 
12.in-addr.arpa nameserver = cbru.br.ns.els-gms.att.net 
12.in-addr.arpa nameserver = cmtu.mt.ns.els-gms.att.net 
12.in-addr.arpa nameserver = dmtu.mt.ns.els-gms.att.net 

> server dbru.br.ns.els-gms.att.net. 

Default Server: dbru.br.ns.els-gms.att.net 
Address: 199.191.128.106 

> 129.128/25.136.17.12.in-addr.arpa. 

128/25.136.17.12.in-addr.arpa nameserver = ns2.intel-research.net 

128/25.136.17.12.in-addr.arpa nameserver= nsl.intel-research.net 

> server nsl.intel-research.net. 

Server: nsl.intel-research.net 

Address: 12.155.161.131 

> 129.128/25.136.17.12.in-addr.arpa. 

129.128/25.136.17.12.in-addr.arpa 

name = dmz.slouter.seattle.intel-research.net 
128/25.136.17.12.in-addr.arpa 

nameserver = bldmzsvr.berkeley.intel-research.net 
128/25.136.17.12.in-addr.arpa 

nameserver = sldmzsvr.intel-research.net 
bldmzsvr.berkeley.intel-research.net internet address = 12.155.161.131 
sldmzsvr.intel-research.net internet address = 12.17.136.131 

In fhis example, we wish to find ouf fhe name for fhe hosf associated wifh IPv4 
address 12.17.136.129. We have already seen fhaf if has a CNAME RR poinfing 
to fhe canonical name 129.128/25.136.17.12.in-addr.arpa.. We insfrucf our 
resolver to use one of fhe roof servers (F) and arrange for fhe query fype to be for 
a PTR RR. Af fhis poinf we requesf a resolufion for 129.128/25.136.17.12.in- 
addr. arpa.. The roof name server does nof have fhis informafion, and if does nof 
perform recursion, so if refurns fhe name of fhe aufhorifafive servers for fhe domain 
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12 . in-addr. arpa .. Picking one of them (DBRU), we again try to resolve our ques¬ 
tion. This time we find two name servers (nsl and ns2). Picking one of these, we 
are able to resolve the PTR request. It resolves to the name dmz. slouter. Seattle 
.intel-research.net. 

11.5.6.6 Authority (SOA) Records 

In DNS, each zone has an authority record, using an RR type called start of author¬ 
ity (SOA). These records provide authoritative links between portions of the DNS 
name space and the servers that provide the zone information allowing various 
queries to be performed for addresses and other information. The SOA RR is used 
to identify the name of the host providing the official permanent database, the 
responsible party's e-mail address (where is used instead of @), zone update 
parameters, and the default TTL. The default TTL is applied to RRs in the zone 
that are not otherwise assigned an explicit per-RR TTL. 

The zone update parameters include a serial number, refresh time, retry time, 
and expire time. The serial number is increased (by at least 1), usually by the 
network administrator, anytime there is a change to the zone contents. It is used 
by secondary servers to determine if they should initiate a zone transfer (when 
they do not have a copy of the zone contents with largest serial number). The 
refresh time tells secondary servers how long to wait before checking the SOA 
record from the primary and its version number to determine if a zone transfer is 
required. The retry and expire times are used in the case of zone transfer failure. 
The retry value gives the time (in seconds) a secondary will wait before retrying. 
The expire time is an upper bound (in seconds) that a secondary server will keep 
retrying zone transfers before giving up. If it gives up, such a server ceases to 
respond to queries for the zone. In general, a zone can contain a mix of IPv4 and 
IPv6 data and can be accessed using either version of IP. In this example, we use 
IPv6 (using nslookup on an IPv6-only Windows host): 


C:\> nslookup 

Default Server: gw 

Address: feSO::204:5aff:fe9f:9e80 

> set type=soa 

> berkeley.edu. 

Server: gw 

Address: feSO::204:5aff:fe9f:9e80 

Non-authoritative answer: 

berkeley.edu 

primary name server = ns-masterl.berkeley.edu 

responsible mail addr = hostmaster.berkeley.edu 

serial = 2009050116 

refresh = 10800 (3 hours) 

retry = 1800 (30 mins) 

expire = 3600000 (41 days 16 hours) 

default TTL = 300 (5 mins) 
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> server adnsl.berkeley.edu. 

Default Server: adnsl.berkeley.edu 
Addresses: 2607:f140:ffff:fffe::3 

128.32.136.3 

> berkeley.edu. 

Server: adnsl.berkeley.edu 

Addresses: 2607:f140:ffff:fffe::3 

128.32.136.3 


berkeley.edu 

primary name server = ns-masterl.berkeley.edu 

responsible mail addr = hostmaster.berkeley.edu 

serial = 2009050116 

refresh = 10800 (3 hours) 

retry = 1800 (30 mins) 

expire = 3600000 (41 days 16 hours) 

default TTL = 300 (5 mins) 


berkeley.edu 
berkeley.edu 
berkeley.edu 
berkeley.edu 
berkeley.edu 
berkeley.edu 
ns.v6.berkeley.edu 
ns.v6.berkeley.edu 
adnsl.berkeley.edu 
adnsl.berkeley.edu 
adns2.berkeley.edu 
adns2.berkeley.edu 
aodnsl.berkeley.edu 
aodnsl.berkeley.edu 

phloem.uoregon.edu 
phloem.uoregon.edu 


nameserver = ns.v6.berkeley.edu 
nameserver = aodnsl.berkeley.edu 
nameserver = adns2.berkeley.edu 
nameserver = phloem.uoregon.edu 
nameserver = adnsl.berkeley.edu 
nameserver = ucb-ns.NYU.edu 

internet address = 128.32.136.6 
AAAA IPv6 address = 2607:fl40:ffff:fffe::6 
internet address = 128.32.136.3 
AAAA IPv6 address = 2607:f140:ffff:fffe::3 
internet address = 128.32.136.14 
AAAA IPv6 address = 2607:fl40:ffff:fffe::e 
internet address = 192.35.225.133 
AAAA IPv6 address = 

2607:fOlO:3f8:8000:214:4fff:fe45:e6a2 
internet address = 128.223.32.35 
AAAA IPv6 address = 2001:468:dOl:20::80df:2023 


Here we can see that not only did we receive the SOA record, but we also 
received a list of six authoritative name servers, and the IPv4/IPv6 addresses (glue 
records) for five of fhem (fhe address for fhe NYU server is nof given, as glue 
records for NYU . e du would be in a differenf zone supporfed by a differenf server). 
As fhis is one of fhe more inferesfing responses we have seen, lef us look af fhe 
packef confenfs corresponding fo fhe requesf senf fo fhe aufhorifafive name server, 
adnsl.berkeley.edu (see Figure 11-14). 

This frace confains fwo packefs, and we have chosen fo display fhe reply, 
which is fhe more inferesfing of fhe fwo. A query for an SOA RR was senf fo fhe 
hosf 2607:fl40:ffff:fffe::3 (adnsl.Berkeley.EDU) from fhe local sysfem's globally 
scoped IPv6 address 2001:5c0:1101:ed00:5571:5f81:e0a6:4978. The response is car¬ 
ried in an IPv6 dafagram wifh 491 byfes fofal lengfh (fhe Payload Length field is 
451). This parficular packef confains fhe IPv6 header (40 byfes), UDP header (8 
byfes), plus fhe DNS message (443 byfes). The DNS message includes one ques- 
fion, one answer, six aufhorify RRs, and fen addifional RRs. 
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File Edit View Go Capture Analyze Statistics Telephony Tools Help 

S « « « ii , B 0 X s a I ci, ^ a ;|iii|[i e;. gi h m 

No. Time Source Destination 

1 0.000000 2001:5c0:1101:ed00:5571:5f81:e0a6:4978 2607:fl40:ffff:fffe::3 


w I a 


Protocol Info 

DNS Standard query SOA berke1ey.edu 


IS Frame 2: 507 bytes on wire (4056 bits), 507 bytes captured (4056 bits) 

IS Linux cooked capture 

IS Internet Protocol version 6, src: 2607:fl40:ffff:fffe::3 (2607:fl40:ffff:fffe::3), Dst: 2001:5c0:1101:ed00:5571:5f81:e0a6:4978 (2001:5c0:1101:ed00:557 
S User Datagram Protocol, src Port: 53 (53), Dst Port: 58621 (58621) 

5 Domain Name System (response) 
fpeauest in: 11 
[Time: 0.192008000 seconds] 

Transaction ID: 0x0007 

IS Flags: 0x8500 (standard query response, no error) 

Questions: 1 
Answer RRs: 1 
Authority RRs: 6 
Additional RRs: 10 
B Queries 

Bberke1ey.edu: type soA, class in 
B Answers 

Bberke1ey.edu: type soA, class in, mname ns-masterl.berke1ey.edu 
B Authoritative nameservers 

Bberke1ey.edu: type NS, class IN, ns ph1oem.uoregon.edu 

Sberke1ey.edu: type NS, class IN, ns ucb-ns.NYU.edu 

aberke1ey.edu: type NS, class IN, ns aodnsl.berke1ey.edu 

Bberke1ey.edu: type ns, class in, ns adnsl.berke1ey.edu 

Bberke1ey.edu: type ns, class in, ns adns2.berke1ey.edu 

Bberke1ey.edu: type ns, class in, ns ns.v6.berke1ey.edu 

B Additional records 

Bns.v6.berke1ey.edu: type A, class IN, addr 128.32.136.6 
Bns.v6.berke1ey.edu: type aaaa, class IN, addr 2607:fl40:ffff:fffe::6 
Badnsl.berke1ey.edu: type A, class in, addr 128.32.136.3 
Badnsl.berke1ey.edu: type aaaa, class in, addr 2607;fl40:ffff:fffe::3 
Badns2.berke1ey.edu: type a, class in, addr 128.32.136.14 
Badns2.berke1ey.edu: type aaaa, class in, addr 2607:fl40:ffff :fffe: :e 
B aodnsl.berke1ey.edu: type a, class in, addr 192.35.225.133 

Baodnsl.berke1ey.edu: type aaaa, class IN, addr 2607:f010:3f8:8000:214:4fff:fe45:e6a2 
Bphloem.uoregon.edu: type A, class in, addr 128.223.32.35 
Bph1oem.uoregon.edu: type aaaa, class IN, addr 2001:468:d01:20::80df:2023 


Figure 11-14 Response to a DNS query for an SOA record using IPv6. The response includes IPv4 and IPv6 addresses for the zone. 
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The question section contains the labels berkeley and edu and is 18 bytes 
long. The answer section contains the relevant information for fhe berkeley. 
edu domain described earlier and is able fo fake advanfage of compression labels 
fhanks fo fhe confenfs of fhe quesfion secfion. The fofal lengfh for fhis secfion is 58 
byfes. The aufhorify secfion confains six NS records idenfifying name servers. This 
informafion fakes anofher 135 byfes. The addifional informafion secfion includes 
five A records and five AAAA records for a fofal of 220 byfes. The size of fhe 
RDATA field for each AAAA record is 16 byfes, so alfhough fhe IPv6 address can 
be written in fexfual form wifh fhe :: convenfion fo save space, if is nof encoded 
fhis way in fhe packef. Insfead, fhe full 128 bifs of fhe address are used. 

11.5.6.7 Mail Exchanger (MX) Records 

An MX record provides fhe name of a mail exchanger —a hosf willing fo engage 
in fhe Simple Mail Transfer Protocol (SMTP) [RFC5321] fo receive incoming e-mail 
on behalf of users associafed wifh a domain name. When fhe Infernef was sfill 
developing, some sifes did nof have permanenf connecfions buf insfead would 
dial up and conned fo hosfs fhaf did have permanenf Infernef connecfions. In 
such scenarios, fhe e-mail desfinafion mighf be disconnecfed from fhe nefwork 
when e-mail was in fransif, so anofher hosf would hold on fo fhe mail unfil fhe 
desfinafion was affached. This was one mofivafion for fhe inclusion of MX records 
in fhe DNS—fo allow sending hosfs fo deliver e-mail fo an infermediary ("relay 
server") even if fhe frue desfinafion was nof available. Today, MX records are sfill 
used, and mail agenfs prefer fo deliver e-mail fo fhe hosf(s) lisfed in an MX record 
associafed wifh a parficular domain name. 

MX records include a preference value, so fhaf more fhan one MX record may 
be presenf for a parficular domain name. The preference value allows a sending 
agenf fo sorf fhe hosfs in preference order (smaller is more preferable) in deciding 
which hosf fo use as an e-mail desfinafion. For example, we can use fhe host com¬ 
mand again fo query fhe DNS for MX records associafed wifh fhe domain name 
cs.ucla.edu: 


Linux% host -t KIX cs.ucla.edu ns3.dns.ucla.edu 

Using domain server: 

Name: ns3.dns.ucla.edu 

Address: 2607:f600:8001:1::ff:feOl:35#53 
Aliases: 


cs.ucla.edu mail is handled by 13 Pelican.cs.ucla.edu. 
cs.ucla.edu mail is handled by 3 Moa.cs.ucla.edu. 
cs.ucla.edu mail is handled by 13 Mailman.cs.ucla.edu. 

Here we can see fhaf an e-mail addressed fo person@cs.ucla.edu is han¬ 
dled by one of fhree mail servers configured in fhe DNS. All of fhese mail servers 
are parf of fhe cs .ucla. edu domain, buf in general mail servers do nof have fo be 
named wifh fhe same domain as fhe e-mail fhey are handling. These fhree servers 
can be grouped info fwo parfs: one wifh preference 3 and one sef wifh preference 



Section 11.5 The DNS Protocol 


545 


13. The server with the smaller preference number is preferred, so fhe sender firsf 
fries Moa.cs .ucla.edu. If fhaf fails, if fries eifher Pelican or Mailman, selecfed 
af random. 

If is possible fhaf none of fhe MX record fargef hosfs is reachable. This is an 
error condifion. If is also possible fhaf fhere are no MX records presenf, buf fhere 
are CNAME, A, or AAAA records for fhe domain name. If fhere is a CNAME 
record, fhe fargef of fhe CNAME is used in place of fhe original domain name. If 
fhere are A or AAAA records, fhe mail agenf may conned fo fhese addresses. Each 
is considered fo have a preference of zero (called implicit MX). MX record fargefs 
musf be domain names fhaf resolve fo A or AAAA records; fhey cannof poinf fo 
CNAMEs [REC5321]. 

11.5.6.8 Fighting Spam: The Sender Poiicy Framework (SPF) and Text (TXT) 
Records 

Eor oufgoing e-mail, MX records allow fhe DNS fo help defermine fhe names of 
mail relays and servers for a domain. More recenfly, fhe DNS has been leveraged 
by receiving mail agenfs fo defermine which relaying or sending mail servers are 
aufhorized fo send mail from a parficular domain name. This is used fo help com- 
baf spam (unwanfed e-mail) fhaf is senf by a rogue mail agenf prefending fo be an 
aufhorized mail sender. 

E-mail received by a mail server is rejecfed, stored, or forwarded fo anofher mail 
server. Rejecfion can happen for a number of reasons, such as a protocol error or lack 
of available storage space af fhe receiver. If can also be rejecfed because fhe sending 
mail clienf does nof appear fo be fhe proper one for sending e-mail. This capabilify 
is supported by fhe Sender Policy Framework (SPE) and documented in [REC4408], an 
experimenfal REC. There is anofher framework known as Sender ID [REC4406] fhaf 
incorporates SPE's funcfions. If is also experimenfal buf less widely deployed. 

Version 1 of SPE uses DNS TXT or SPE (fype 99) resource records. Records are 
sef up and published in fhe DNS by a domain's owner fo indicafe which servers 
are aufhorized fo send mail originafing from fhe domain. Alfhough fhe SPE record 
fype is a more "proper" place fo carry SPE-relafed informafion in some sense, some 
DNS clienf implemenfafions do nof process SPE records properly, so fo avoid fhis 
complicafion TXT records are used. TXT records hold simple sfrings associafed 
wifh a domain name. Hisforically fhey have held sfrings useful for human con- 
sumpfion, fo aid in debugging or idenfifying fhe owner or locafion of a domain. 
Today, fhey are usually processed by programs such as fhe SPE applicafion. 

SPE supporfs a rich synfax fo express criferia used fo mafch againsf defails 
abouf an incoming mail message and fhe connecfion in which if is carried. Eor 
example, UC Berkeley uses fhe following SPE enf ry (some lines have been wrapped 
for clarify): 


Linux% host -t txt berkeley.edu 

berkeley.edu descriptive text 

"v=spfl ip4:169.229.218.128/25 ip6:2607:F140:0:1000::/64 
include: outboundmail. convio. net -“all" 
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In this example, the information being provided is for SPF version 1 (indicafed 
by fhe v=spfl sfring in fhe version secfion) and uses a TXT RR. When a receiv¬ 
ing mail agenf receives e-mail purporfedly coming from fhe domain berkeley. 
edu, if performs a DNS query for records of type TXT against the berkeley.edu 
domain. The value of fhe fexf record confains fhe mafching criferia (called mecha¬ 
nisms) and ofher informafion (called modifiers). Preceding each mechanism is a 
qualifier fhaf defermines fhe consequence of a mafching mechanism. Processing 
of SPF records fakes place using a funcfion called check_host(). The funcfion 
evaluafes various mechanisms and complefes when fhe firsf mafching mechanism 
is encounfered. Ulfimafely, check_host {) provides a refurn value fhaf is one of 
fhe following: None, Neufral, Pass, Fail, SoffFail, TempError, PermError. The None 
and Neufral refurn values indicafe fhaf no informafion was available or fhaf infor¬ 
mafion was available buf fhaf no resulf is asserfed. These are handled idenfically. 
Pass indicafes a mafch, as described in fhe nexf paragraph. Fail indicafes fhaf fhe 
sending hosf is nof aufhorized fo send mail from fhe domain. SoffFail is some- 
whaf ambiguous buf is fo be freafed "somewhere befween a 'Fail' and a 'Neufral,'" 
according fo [RFC4408]. The TempError refurn indicafes some fransienf failure 
(e.g., communicafion failure) fhaf is likely fo abafe. The PermError refurn indicafes 
fhaf fhere was a problem in fhe SPF configurafion, usually due fo a malformed 
TXT or SPF record for fhe domain. 

Reading from leff fo righf in fhe example, fhe sfring v^spflisa modifier indi- 
cafing fhaf fhe SPF version is 1. The ip 4 mechanism specifies fhaf fhe SMTP sender 
has an IPv4 address from fhe prefix 169.229.218.128/25. The ip6 mechanism 
specifies any sending hosf wifh IPv6 address prefix 2607:F140:0:1000::/64. 
Finally, fhe include mechanism incorporafes, by reference, fhe TXT records wifh 
outboundinail.convio.net: 

Linux% host -t txt outboundinail.convio.net 

outboundmail.convio.net descriptive text 

"v=spfl +ip4:66.45.103.0/25 +ip4:69.48.252.128/25 
+ip4:209.163.168.192/26 -all" 
outboundmail.convio.net descriptive text 
"spf2.0/pra 

+ip4:66.45.103.0/25 +ip4:69.48.252.128/25 
+ip4:209.163.168.192/26 -all" 

Nofe fhaf fhese TXT records are used for bofh SPF and for Sender ID (which 
uses fhe value of spf2.0/pra in fhe version secfion). The firsf record is used 
by SPF. The + qualifier indicafes fhaf a mafch resulfs in a Pass indicafion. Any 
mechanism missing a qualifier is assumed fo have fhe + qualifier. Ofher possible 
qualifiers include - (Fail), ~ (Soff Fail), and ? (Neufral). If none of fhe mafch¬ 
ing mechanisms produces a Pass resulf, fhe final mechanism (all) mafches any 
condifion. The filde characfer (~) before fhe all criferion indicafes fhaf a SoffFail 
refurn should be generafed if all is fhe only mafching mechanism. The exacf way 
a soff failure is handled is dependenf on fhe receiving e-mail software. Nofe fhaf 
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even with SPF support, validation is provided only on the sending domain and 
system, and not on the sending user. In Chapter 18 we will look at DKIM, which 
provides SPF-like capabilities but uses cryptography for authentication. 

11.5.6.9 Option (OPT) Pseudo-Records 

In conjunction with EDNSO, described previously, a special OPT pseudo-RR has 
been defined [RFC2671]. If is "pseudo" in fhe sense fhaf if perfains only fo fhe 
confenfs of a single DNS message and is nof convenfional DNS RR dafa. Conse- 
quenfly, OPT RRs are nof cached, forwarded, or persisfenfly stored, and fhey may 
appear only once (or nof af all) in a DNS message. If one is presenf in a DNS mes¬ 
sage, if is found in fhe addifional informafion secfion. 

An OPT RR confains a 10-byfe fixed porfion followed by a variable porfion. 
The fixed porfion includes 16 bifs indicafing fhe RR type (41), 16 bits indicating the 
UDP payload size, 32 bits constituting an extended RCODE field and flags area, 
and 16 bifs giving fhe size of fhe variable porfion in bytes. These fields are locafed 
in fhe same relafive posifions as fhe Name, Type, Class, TTL, and RDLEN fields, 
respecfively, in a convenfional RR (see Figure 11-8). OPT RRs use a null domain 
name in fhe Name field (0 bytes). The extended RCODE and Flags area (32 bifs, cor¬ 
responding fo fhe TTL field in Figure 11-8) is subdivided info an 8-bif area fo hold 
an exfra 8 high-order bifs augmenfing fhe RCODE field in Figure 11-3, and an 8-bif 
Version field (currenfly sef fo 0 fo indicate EDNSO). The remaining 16 bifs are nof 
yef defined and musf be 0. The addifional 8 bifs provide an exfended sef of pos¬ 
sible DNS error fypes, and fhese values are given in Table 11-4. (Note fhaf value 16 
is defined by fwo disfincf RFCs.) 


Table 11-4 Extended RCODE values. Most are used to support security extensions. 


Value 

Name 

Reference 

Description and Purpose 

16 

BADVERS 

[REC2671] 

Bad EDNS version 

16 

BADSIG 

[REC2845] 

Bad TSIG signature (see Chapter 18) 

17 

BADKEY 

[REC2845] 

Bad TSIG key (see Chapter 18) 

18 

BADTIME 

[REC2845] 

Bad TSIG signature (time problem; see Chapter 18) 

19 

BADMODE 

[REC2930] 

Bad TKEY mode (see Chapter 18) 

20 

BADNAME 

[REC2930] 

Duplicate key name (see Chapter 18) 

21 

BADALG 

[REC2930] 

Algorithm not supported (see Chapter 18) 


As we have menfioned, OPT RRs confain a variable-lengfh RDATA field. This 
field is used fo hold an exfensible lisf of affribufe-value pairs. The currenf sef of 
affribufes, meanings, and defining RFCs is mainfained by fhe lANA [DNSPAR- 
AMS]. One such opfion, called NSID (EDNS opfion code 3) [RFC5001], indicafes a 
special idenfifying value for a responding DNS server. The formaf of fhis value is 
nof defined by sfandard buf is instead configurable by fhe system adminisfrafor 
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of the DNS server. This capability may be useful in circumstances where an any- 
cast address is used to identify a group of servers. The NSID is able to identify a 
specific responding server using a value other than the sending IP address. We 
shall see more examples of OPT RRs and EDNSO usage when we look at DNSSEC 
in Chapter 18. 

11.5.6.10 Service (SRV) Records 

[REC2782] defines the service (SRV) resource record. SRV RRs generalize the MX 
record format to describe the host, protocols, and port numbers used to contact a 
particular service. An SRV RR is ordinarily structured as follows: 

_Service._Proto.Name TTL IN SRV Prio Weight Port Target 

The Service identifier is the official name of a service. The Proto identifier 
is the transport protocol used to access the service, usually TCP or UDP. The TTL 
value is a conventional RR TTL, and IN and SRV indicate the Internet class and 
SRV RR type, respectively. The Prio value is a 16-bit unsigned value and works 
like the priority value in MX records (lower numbers represent higher priorities). 
The Weight value is used to choose an RR among several whose priority values 
are equal. The idea is that the weight is to be used as a weighted probability to 
select the particular entry for load balancing, so larger weights indicate a greater 
probability of selection. The Port is the TCP or UDP (or other transport protocol's) 
port number. The Target is the domain name of the target host where the ser¬ 
vice is being provided. The Name identifier is the containing domain in which a 
particular service is to be found. One of the purposes of SRV records is to identify 
when multiple individual servers in a domain support the same service. 

Por example, if a client would like to determine the host and port where the 
Idap service is available using the TCP protocol in the domain example.com, 
it would perform a query for SRV records using the domain name _ldap._tcp 
.example.com. Here is a real-world example: 

Linux% host -t srv _ldap._tcp.openldap.org 

_ldap._tcp.openldap.org has SRV record 0 0 389 www.openldap.org. 

In this example, we are looking for a server providing the Lightweight Direc¬ 
tory Access Protocol (LDAP) [RPC4510] service over TCP within the domain 
openldap. org. We find that it can be accessed at the server www. openldap. org 
using TCP port 389 (the default LDAP port). The Priority and Weight values 
are 0, as there are no alternative servers. 

[RPC2782] did not specify a new lANA registry for SRV Service and Proto 
values. So, by default, the names correspond to the names maintained in lANA's 
"Service Name and Transport Protocol Port Number" registry [ISPR], and the Proto 
values are either _tcp or _udp. There are a few exceptions, however. [RPC5509] 
establishes conventions for SIP-based presence and instant messaging using the 
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followingSRV Serviceand Protonames:_im._sipand_pres._sip. [RFC6186] 
defines fhe following SRV Service names for e-mail user agenfs fo easily discover 
fhe confacf informafion for IMAPS, SMTP, IMAP, and POPS servers (fhe firsf fwo are 
ordinarily preferred when seffing up an an e-mail clienf): _submission, _imap, 
_imaps, _pop3, _pop3s. Alfhough [RFC6186] doesn'f require fhese names fo use 
TCP as fhe corresponding Proto value, fhis is currenfly fhe only real opfion. For 
example, a user configuring a new mail user agent (MUA, essenfially an e-mail pro¬ 
gram) mighf specify only fhe domain example.com. The MUA implemenfafion 
would fhen likely perform DNS queries for af leasf _submission ._tcp. example. 
comand _imaps._tcp.example.com. 

11.5.6.11 Name Authority Pointer (NAPTR) Records 

The Name Authority Pointer (NAPTR) RR type is used when DNS supports a 
Dynamic Delegation Discovery System (DDDS) [RFC3401]. A DDDS is a general, 
abstract algorithm for applying dynamically retrieved string transformation rules 
to strings provided by applications and using the results, most often, for locating 
resources. Each DDDS application customizes the operation of the general DDDS 
rules for its particular use case. A DDDS includes a rules database and a set of 
algorithms for forming strings that are used with the database to produce output 
strings. DNS is one such database [RFC3403], and with it the NAPTR resource 
record type is used to hold the transformation rules. One such DDDS application 
has been defined for use with DNS to handle multinational telephone numbers 
and convert them to a standard Uniform Resource Identifier (URI) format [RFC3986] 
using ENUM (see Section 11.5.6.12). 

In a DDDS, an algorithm [REC3402] directs how an application-unique string 
(AUS) is processed by rules contained in a database. The result can be either a 
terminal string (complete output) or another (nonterminal) string used to retrieve 
another rule that is applied to the AUS. In all, the collection forms a powerful 
string rewriting system that can be used to encode nearly anything that has a suf¬ 
ficiently regular syntax. The essence of this algorithm is captured in Eigure 11-15. 

The process illustrated in Eigure 11-15 starts by applying the first Well-Known 
Rule to the AUS, which is uniquely identified for each application. The result forms 
a key used to retrieve another rule from a database. Rules are string-rewriting pat¬ 
terns and flags that are applied to the AUS, but never to the result of a rewritten 
string. The particular way this works is dependent on the application, but usu¬ 
ally the rules are regular expression substitutions, similar to those used with the 
UNIX sed program [DR97]. When using the DNS as a database for supporting a 
DDDS [REC3403], the case in which we are interested, the keys are domain names 
and the rules are stored in NAPTR resource records. Each NAPTR RR contains 
the following fields: Order, Preference, Flags, Services, Regular Expression (sometimes 
abbreviated Regexp), and Replacement. 

The Order field is a 16-bit unsigned integer specifying which NAPTR record to 
use before others (lower numbers are preferred to higher ones), as the DNS archi¬ 
tecture does not guarantee the ordering of any particular set of resource records. 
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Figure 11-15 Abstract operation of the DDDS algorithm. Non-terminal records are permitted to form 
loops. Each iteration involves a string rewrite operation on the application's unique 
string. 


The Preference field is used to influence the order of records containing the same 
order number. The Order field is supposed to place a mandatory ordering on RRs, 
whereas the preference number is advisory. The Flags field contains an unor¬ 
dered list of single characters from the set A-Z and 0-9 (case-insensitive). The 
particular application using NAPTR records (e.g., ENUM, described in the next 
section) defines the interpretation of the Flags field. The Services field is defined 
by the application to indicate which type of service is being described. The Regu¬ 
lar Expression field contains a substitution expression that is applied to the AUS 
to form the identity of another server to use for another NAPTR lookup (non¬ 
terminal case) or the output string (terminal case). The Replacement field (which 
exists only when the Regular Expression does not) indicates the next server to query 
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for NAPTR records. It is encoded as a separate FQDN (no name compression is 
used within the DNS message). The uses for these two final (mutually exclusive) 
fields are very similar for historical reasons in the development of the NAPTR RR. 

To get a better sense of how NAPTR processing works with applications, we 
will have a brief look at the ENUM and SIP DDDS applications, the URI/URN 
DDDS applications, and alternatives for regular NAPTR records called S-NAPTR 
and U-NAPTR. Specifying a DDDS entails specifying the application's AUS, first 
Well-Known Rule, expected output, valid databases, flags, and service parameters. 

11.5.6.12 ENUM and SIP 

In the ENUM DDDS [R06][REC6116][REC6117][REC5483], which is used to map 
phone numbers to URI information, the AUS is an E.164-format telephone number 
(up to 15 digits starting with the + character). The initial + character differentiates 
E.164 numbers acceptable for use with the ENUM DDS from numbers in other 
name spaces. The first Well-Known Rule starts by removing any dashes or other 
non-digit characters in the AUS. The DDDS database is the DNS, where keys are 
domain names created from the AUS (which now consists only of digits) as fol¬ 
lows: dot (.) characters are inserted between each digit and the result is reversed. 
Then, the suffix .el64.arpa is added. Eor example, the E.164 number +1-415- 
555-1212 would be tranformed to the key 2.1.2.1.5.5.5.5.1.4.1.el64.arpa. 
The resulting domain name is used to query for NAPTR records. 

The final output, possibly after multiple loops of the DDDS algorithm shown 
in Figure 11-15, is an absolute (not relative) URI. The only flag defined is the U 
flag, indicating a terminal rule that produces a URI. The lack of any flag indicates 
a non-terminal rule, sometimes called a non-terminal NAPTR (NTN). The service 
parameters, encoded in the Service field of the NAPTR record, are of the form 
E2U+Sez'vice, which derives from the string E2U (an indicator for E.164 to URI) 
plus a Service name subfield providing information about particular services 
associated with the number. Together, they form an enumservice identifier, and such 
services are registered with the lANA [ENUM] [RFC6117]. Many have been created, 
including enumservices for fax, instant messaging, and presence indicators. 

To see how this all works, we can construct a query for the number 
+420738511111 at the University of Ostrava in the Czech Republic (lines are 
wrapped for clarity): 

Linux% host -t naptr 1.1.1.1.1.5.8.3.7.0.2.4.el64.arpa 

1.1.1.1.1.5.8.3.7.0.2.4.el64.arpa has NAPTR record 

50 50 "u" "E2U+sip" "!^\\+{.*)$!sip:\\l@osu.cz!" . 

1.1.1.1.1.5.8.3.7.0.2.4.el64.arpa has NAPTR record 

100 50 "u" "E2U+sip""!^\\+(.*)$!sip:\\l@cesnet.cz!" . 

1.1.1.1.1.5.8.3.7.0.2.4.el64.arpa has NAPTR record 

200 50 "u" "E2U+h323" "!^\\+(.*)$!h323:\\l@gklext.cesnet.cz!" . 

Here we see the contents of three NAPTR records in the ENUM DDDS application, 
two for the SIP service and one for the H.323 service, used for Internet telephony. 
The order numbers are 50 and 100 for the SIP entries and 200 for the H.323 entry. 
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showing how it is possible using ENUM and NAPTR records to have multiple 
services associated with a single telephone number, and how the provider of the 
NAPTR records can indicate a preferred ordering of more fhan one gafeway pro¬ 
viding fhe same service. 


Note 

SIP is an lETF-specified protocol used for signaling and is especially popular for 
facilitating the connection of multimedia clients and servers. H.323 is an ITU- 
specified protocol for multimedia conferencing and communication, including a 
signaling sub-protocol. It is widely implemented in teleconferencing equipment. 
In this example and those that follow, the host program produces output that can 
be used as input to a zone file for a DNS server such as BIND. As a consequence, 
the output shows extra escape “\” characters (which appear as “W”) that are not 
present in the actual DNS responses provided by the server. 


To better understand how a NAPTR record's rules are applied to the AUS, 
we will look at the second SIP record from the preceding example. After the DNS 
query is performed and the NAPTR RR is received, the string appearing between 
the first and second ! characters is used as a regular expression match and replace¬ 
ment. Thus, the string -1-420738511111 is matched against the regular expression 
''\-i- (.*) $. According to the matching rules for regular expressions, the match is 
successful, so the string rewrite rule becomes sip:\l@cesnet.cz . The special 
variable \1 is replaced with the substring matching the first regular expression 
contained in parenthesis characters, (), which in this case is everything in the 
AUS except for the initial ^- character. In summary, the AUS -1-420738511111 is 
transformed into the URI sip:420738511111@cesnet.cz. 

Once this URI is formed, the natural next step is for the driving application to 
contact a SIP server. However, SIP can itself be carried over different transport pro¬ 
tocols, so the next step uses another DDDS that is tailored for SIP [RFC3263]. In this 
application, NAPTR records contain targets that identify the domain that should 
be used to perform SRV record queries. Continuing with the preceding example: 


Linux% host -t naptr cesnet.cz 

cesnet.cz has NAPTR record 200 50 "s" "SIP-^D2T" "" _sip._tcp.cesnet.cz. 
cesnet.cz has NAPTR record 100 50 "s" "SIP-^D2U" "" _sip._udp.cesnet.cz. 

Here we see the use of the s flag in the NAPTR, indicating that an SRV record 
is the result. The Regexp field is not used, so the result is a simple domain name 
substitution, given by the string in the Replacement field. The Service field is of the 
form SIP-tD2x or SIPS-tD2x where SIP and SIPS indicate the use of the SIP pro¬ 
tocol and SIP protocol with security (TLS; see Chapter 18), respectively, and x is 
the single-letter identifier of the transport protocol: U for UDP, T for TCP, and S for 
SCTP [RFC4960]. In this example, the application would first attempt to look up 
and use the SRV record corresponding to SIP/UDP and would resort to SIP/TCP 
if that fails because the UDP entry has a lower preference value. 



Section 11.5 The DNS Protocol 


553 


11.5.6.13 URI/URN Resolution 

Although ENUM may be the most mature use of NAPTR records in the DNS, there 
are also DDDS applications defined for resolving URIs [RFC3404] and for persis- 
fenf, locafion-independenf URIs called Uniform Resource Names (URNs) [RFC2141]. 
All URIs (including URNs) consisf of a scheme name followed by a subsfring com- 
plianf wifh semanfics fhaf are specific fo fhe scheme. The currenf lisf of official 
schemes is mainfained by fhe I ANA [URI]. The URI and URN applicafions are so 
similar fhaf if is worfh considering fhem fogefher. For fhe URI/URN DDDS appli- 
cafion, fhen, fhe AUS is fhe URI or URN for which an aufhorifafive "resolufion" 
server is being locafed. The firsf Well-Known Rule for fhe URI applicafion is sim¬ 
ply fhe scheme name. For URNs, if is fhe name space idenfifier (fhe subsfring fhaf 
appears affer fhe urn: scheme idenfifier and before fhe nexf colon characfer). For 
example, http: //www.pearson.com is a URI using fhe scheme (key) http, and 
fhe URN urn: foo: f oospace would use foo as fhe firsf key. Four possible flags 
are currenfly defined: S, A, U, and P. The firsf fhree are ferminal and indicafe fhaf 
fhe resulf is fhe domain name fo use for fefching an SRV record, an IP address, or 
a URI, respecfively. The P flag indicafes fhaf processing of fhe DDDS algorifhm is 
fo be disconfinued and some applicafion-specific processing (defined elsewhere) 
begins. All such flags are mufually exclusive. As wifh ENUM, fhe lack of any flag 
indicafes an NTN. 

Supporf for fhe URI/URN DDDS is sfill evolving. If we fake a currenf (2011) 
look info fhe DNS, we can see how some of fhe schemes have been populafed info 
fhe uri. arpa TLD: 

Linux% host -t naptr http.uri.arpa 

http.uri.arpa has NAPTR record 0 0 "" "" "!^http://([^:/?#]*).*$!\\1!i" . 

Linux% host -t naptr ftp.uri.arpa 

ftp.uri.arpa has NAPTR record 0 0 "" "" "!^ftp://{[^:/?#]*).*$!\\1!i" . 
Linux% host -t naptr mailto.uri.arpa 

mailto.uri.arpa has NAPTR record 0 0 "" "" "!^mailto:(.*)@{.*)$!\\2!i" . 
Linux% host -t naptr urn.uri.arpa 

urn.uri.arpa has NAPTR record 00"" "" "/urn:{[^:]+)/\\l/i" . 

The firsf fhree of fhese NAPTR records confain rewrife rules and no flags. 
Thus, fhey essenfially indicafe fhaf fhe applicafion should exfracf fhe domain 
name from fhe corresponding URI and confinue fhe DDDS algorifhm. The frail- 
ing i flag affer fhe lasf ! characfer indicafes fhaf case checking is fo be performed 
in an insensifive way. For example, mAiLto: per son® example, com is rewriffen 
fo be jusf example .com. The fourfh record is used fo exfracf fhe URN name space 
ID and confinue processing. For URNs, fhere are a small number (fwo af presenf) 
of NAPTR records in fhe DNS sef up in urn. arpa: 

Linux% host -t naptr pin.urn.arpa 

pin.urn.arpa has NAPTR record 100 100 "" "" "" pin.verisignlabs.com. 

Linux% host -t naptr uci.urn.arpa 

uci.urn.arpa has NAPTR record 100 100 "" "" "" uci.or.kr. 
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These URN name spaces appear to be receiving little attention at present, and 
it is still unclear to what extent URNs will be widely used, as there are now com¬ 
peting methods for expressing and locating objects using persistent identifiers 
(e.g., see [PIO]). Neverfheless, more fhan 40 URN name spaces have been defined 
[URN], so fhere confinues fo be communify inferesf in esfablishing name spaces, 
even fhough few have corresponding global, acfive NAPTR records. 

11.5.6.14 S-NAPTR and U-NAPTR 

A common issue arises when an applicafion wishes fo defermine fhe parficular 
hosf, profocol, and porf number fo use for reaching a service wifhin a domain. 
For example, a mail-reading applicafion running on a user's compufer in fhe 
example.com domain may need fo find a server offering fhe IMAP service. A 
convenfion has arisen fo simply prepend fhe service name fo fhe domain (e.g., 
imap.example.com). Using CNAME, A, or AAAA records is somewhaf inflex¬ 
ible, because fhese record fypes do nof convey any indicafion of which fransporf 
profocol or porf number fo use. SRV records go furfher by providing anofher layer 
of indirecfion, buf fheir fargefs may confain only domain names for which an A or 
AAAA record is subsequenfly refrieved. Using NAPTR records insfead provides 
more flexibilify fhrough an addifional layer of indirecfion and allows for ofher 
fargef record fypes (such as SRV records) fo be used. 

The NAPTR sfrucfure and rewrife capabilifies have caused concern for some 
implemenfers and operafors given fhe complexify of fhe regular expressions. In 
an efforf fo simplify fhe sifuafion yef sfill provide a mefhod beyond basic SRV 
records for locafing services, straightforward NAPTR (S-NAPTR) [RFC3958] speci¬ 
fies a DDDS applicafion for mapping domain "labels" fhaf confain a service name 
using cerfain simplifying resfricfions on fhe confenfs of fhe NAPTR records. 

For S-NAPTR, fhe AUS is a domain label for which an aufhorifafive server for 
a parficular service is soughf. The firsf Well-Known Rule is fhe idenfify funcfion. 
The expecfed oufpuf is fhe informafion necessary fo confacf a parficular applica¬ 
fion service wifhin a domain (e.g., profocol, hosf, porf). Only S and A ferminal 
flags are permiffed, which indicafe an SRV RR or a domain name (which is fo 
be used fo form a subsequenf requesf for an A or AAAA RR), respecfively. The 
service paramefers are faken from a sef mainfained in an I ANA regisfry [SNP], 
and fhe Regexp field is nof used. Only fhe Replacement field is acfive. S-NAPTR is 
used in conjuncfion wifh fhe Internet Registry Information Service (IRIS) [RFC3981], 
an XML-based fexf applicafion profocol for exchanging informafion perfaining 
fo domain name and ofher regisfrafion informafion whose dafabase is confained 
wifhin fhe iris. arpa porfion of fhe DNS name space; for example: 

Linux% host -t naptr areg.iris.arpa 

reg.iris.arpa NAPTR 

100 10 "" "AREGl:iris.xpc:iris.Iwz" "" areg.nro.net. 

This example uses S-NAPTR (no regular expression) fo indicafe fhaf in order fo 
perform an ISIS query for AREGl-fype dafa (see [RFC4698]), a subsequenf NAPTR 
query should be inifiafed fo areg.nro.net. 
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Experience and further consideration of S-NAPTR led to the development of 
URI-enabled NAPTR (U-NAPTR) [RFC4848], which relaxes some of the restrictions 
of S-NAPTR but maintains all of its other features and registries. Most important, 
an additional U flag is permitted, which enables the NAPTR record target to be 
a URI and thus allows the use of regular expressions. This is similar to the fully 
generic version of NAPTR, except U-NAPTR regular expressions are restricted to 
the following form: ! .*!<URI>!. That is, the entire AUS is replaced with a URI. 
U-NAPTR is being used in conjunction with the Location-to-Service Translation pro¬ 
tocol (LoST) [RFC5222], which can be used to determine the correct service given a 
point of network attachment and geographical location. Such information is use¬ 
ful in public safety applications where geography dictates the particular jurisdic¬ 
tion and responsible parties that should provide emergency services. 

11.5.7 Dynamic Updates (DNS UPDATE) 

It is possible to dynamically update a zone, called DNS UPDATE, using a protocol 
defined in [RFC2136]. It supports the ability to specify prerequisites in conjunction 
with an update request. Prerequisites are evaluated at the server; if they are not 
true, the update is not performed and an error message is returned. 

DNS UPDATE is accomplished by sending dynamic update DNS messages to an 
authoritative DNS server for a zone. The structure of such messages is the same as 
for a conventional DNS message, except the header fields and sections have differ¬ 
ent names (see Figure 11-3). The sections indicate the zone being updated, prereq¬ 
uisites that require various RRs to be present (or not) for the update to take effect, 
and the update information. In an update, the header mirrors the format for a query, 
but the Opcode field is set to Update (5). The header fields ZOCOUNT, PRCOUNT, 
UPCOUNT, and ADCOUNT contain counts of the following: zones to be updated 
(this will have the value 1), prerequisites to consider, updates to be made, and 
additional information records, respectively. [RFC2136] also defines a collection of 
RCODE values carried in DNS response messages capable of indicating conditions 
relating to problems with the prerequisites or server (values 6-10 in Table 11-2). 

The zone section of an update message (see Figure 11-7) indicates the zone's 
name, a type, and a class. The type value will be 6 to indicate the presence of an 
SOA record, which identifies the zone. The class value will be 1 (Internet) for any 
update message with which we are concerned. All records being updated must be 
in the same zone. 

The prerequisite section of an update message contains one or more prereq¬ 
uisites, expressed using the format for RRs we discussed previously in Section 
11.5.5. There are five types of prerequisites: RRSet exists (value-dependent and 
value-independent varieties), RRSet does not exist, name is in use, and name is not in 
use. Recall that an RRSet is a group of RRs from the same zone sharing a common 
name, class, and type. To express the semantics of a prerequisite, a combination of 
an RR's class, type, and RDATA values are set according to Table 11-5. 

The RRSet exists type means that at least one RRSet exists in the zone specified 
in the zone section that matches the name and type of the corresponding RR in 
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Table 11-5 RR Class and Type fields used in prerequisite section to indicate prerequisite type 


Prerequisite Type (Semantics) 

Class Setting 

Type Setting 

RDATA Setting 

RRSet exists (value-independent) 

ANY 

Same as zone's type 

Empty 

RRSet exists (value-dependent) 

Same as zone's class 

Type being checked 

RRSet being 
checked 

RRSet does not exist 

NONE 

Type being checked 

Empty 

Name is in use 

ANY 

ANY 

Empty 

Name is not in use 

NONE 

ANY 

Empty 


the prerequisite section. In the value-dependent case, the prerequisite is true only 
if the matching RRs also contain matching RDATA values. The RRSet does not exist 
type means that no RRSet in the zone specified in fhe zone secfion mafches fhe 
name and type of fhe RR in fhe Prerequisifes secfion. The lasf fwo cases {Name is 
in use and Name is not in use) refer only fo fhe domain name; fhe type value is not 
used. The values for NONE and ANY as DNS classes are 254 and 255, respecfively. 

Following fhe Prerequisife secfion, fhe Updafe secfion confains RRs fo be 
added or delefed from fhe zone specified in fhe zone secfion. There are four fypes 
of updafes, encoded as an RR wifh various combinafions of values in fhe Class, 
Type, and RDATA fields, as indicafed in Table 11-6. 


Table 11-6 RR Class and Type fields used in Update section to indicate update type 


Use 

Class Setting 

Type Setting 

RDATA 

Add RR to RRSet 

Same as zone's 
class 

Type of RR being 
added 

RDATA of RR being 
added 

Delete RRSet 

ANY 

Type of RRSet to 
delete 

Empty (TTL and 
RDLENGTH also zero) 

Delete all RRSets from a name 

ANY 

ANY 

Empty (TTL and 
RDLENGTH also zero) 

Delete RR from RRSet 

NONE 

Type of RR being 
deleted 

Matching RDATA to 
delete 


The update section contains a collection of RRs that are processed provided no 
errors have occurred due to prerequisites or server problems. Each RR encodes an 
addition or deletion operation. Modifications can be performed as a deletion fol¬ 
lowed by an addition. To see an example of DNS UPDATE, we can induce a Win¬ 
dows machine to perform a dynamic DNS update using the following command: 


C:\> ipconfig /registerdns 


Windows clients issue updates for their computer name and domain name by 
default, but this behavior can also be enabled for IPv4 on a per-DNS-suffix basis 
by checking the box labeled "Use this connection's DNS suffix in DNS registration" 
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under the DNS section of the Advanced TCP/IP Settings, found on fhe General fab 
of fhe Infernef Protocol (TCP/IP) Properfies menu associated wifh each interface 
enabled for TCP/IP For IPv6, fhe same procedure is used, buf on fhe IPv6 Properfies 
menu. In fhe example shown in Figure 11-16, we can see how fhe machine named 
vista updates fhe local zone dyn. home as if issues fhe DNS update message shown. 


Q dns-dyn.tr - Wireshark 


□0(H) 

File Edit View Go Capture Analyze Statistics 

Telephony Tools Help 



A ♦ ♦ « ¥ a : tW|a: (±1 Gi q. H 

gl IS ^ g) 

No. Time Source Destination 

Protocol Info 


1 0.000000 10.0.0.57 10.0.0.1 

DNS Dynamic update SOA dyn.home 


1 2 0.001059 10.0.0.1 10.0.0. 57 

DNS Dynamic update response 

V 

> 

S Frame 1: 162 bytes on wire (1296 bits), 162 bytes captured (1296 bits) 

Q Ethernet II, Src: 00:13:02:20:b9: 

18 (00:13:02:20:b9:18), Dst: 00:04:5a:9f:9e; 

:80 (00:04:5a:9f:9e:80) 

a internet protocol, src: 10.0.0.57 

(10.0.0.57), DSt: 10.0.0.1 (10.0.0.1) 


a user Datagram protocol, Src Port: 

58973 (58973), DSt Port: 53 (53) 


a Domain Name System (query) 1 


PResponse in: 21 
Transaction ID: 0x4089 
s Flags: 0x2800 (Dynamic update) 
zones: 1 
Prerequisites: 1 
updates: 4 
Additional RRs: 0 
s zone 

s dyn.home: type SOA, class in 
Q Prerequisites 

a vista.dyn.home: type cname, class none 
N ame: vista.dyn. home 

Type: cname (canonical name for an alias) 

Class: NONE (OxOOfe) 

Time to live: 0 time 
Data length: 0 
S updates 

a vista.dyn.home: type aaaa, class any 
a vista.dyn.home: type A, class any 

a vista.dyn.home: type aaaa, class IN, addr 2001:5c0:1101:ed00:fd26:de93:5ab7:405a 
a vista, dyn. home: type A, class IN, addr 10.0.0.57 


Figure 11-16 A DNS dynamic update contains an SOA record in the zone section and RRs in 
the update section. This case includes new IPv4 and IPv6 addresses for the host 
vista. dyn. home. 

Figure 11-16 shows how a dynamic updafe is encoded. The DNS server af 
10.0.0.1 (running BIND9 [AL06] in fhis example) is configured fo allow dynamic 
updates. The zone secfion confains an SOA record idenfifying fhe zone fo be 
updated (vista, dyn. home). The prerequisife secfion confains an RR wifh a zero- 
lengfh RDATA secfion and 0 TTL value. The RR corresponds fo fhe type of pre¬ 
requisite in the third row of Table 11-5 {RRset does not exist) because its type is not 
ANY (it is CNAME) and its class is set to NONE (254). 

In this particular case, the addresses 10.0.0.57 and 2001:5c0:1101:ed00:fd26: 
de93:5ab7:405a are to be associated with the name vista.dyn.home. This is 
accomplished by first deleting the AAAA and A RRSets (corresponding to row 2 
in Table 11-6), and then adding the AAAA and A RRSets (corresponding to row 1 
in Table 11-6) for the desired addresses. 
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0 dns-dyn.tr - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony lools Help 

Si w « ai it B 0 X 3 a I c), ^ 4 « ? ^ i|Ei|si ^ □ ei b « ii: s 

No, Time Source Destination Protocol Info 

1 0.000000 10.0.0.57 10.0.0.1 DNS Dynamic update SOA dyn.home 


2 0.001059 10.0.0.1 10.0.0.57 DNS Dynamic update response 


> 

li Frame 2: 54 bytes on wire (432 bits), 54 bytes captured (432 bits) 

a Ethernet II, Src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 00:13:02:20:b9:18 (00:13:02:20:b9:18) 
a Internet Protocol, src: 10.0.0.1 (10.0.0.1), Dst: 10.0.0.57 (10.0.0. 57) 
a user Datagram Protocol, src Port: 53 (53), Dst Port: 58973 (58973) 
a Domain Name System (response) 
fpeauest in: 11 
[Time: 0.001059000 seconds] 

Transaction ID: 0x4089 

a Flags: 0xa880 (Dynamic update response, no error) 


1. = Response: Message is a response 

.010 1. = opcode: Dynamic update (5) 

.0.= Authoritative: Server is not an authority for domain 

.0.= Truncated: Message is not truncated 

.0. = Recursion desired: Don't do query recursively 

.1. = Recursion available: server can do recursive queries 

.0.= Z: reserved (0) 

.0.= Answer authenticated: Answer/authority portion was not authenticated by the server 

.0 .... = Non-authenticated data: unacceptable 


. 0000 = Reply code: no error (0) 

zones: 0 
Prerequisites: 0 
updates: 0 
Additional rrs: 0 


Figure 11-17 The response to a dynamic update request includes a transaction ID and status flag set. 


Responses to DNS updates are straightforward and compact. The response for 
fhe updafe shown in Figure 11-16 is illusfrafed in Figure 11-17. 

The Flags field indicafes a successful updafe (no error). The fransacfion ID 
(0x4089) is used fo ensure fhaf fhe updafe response mafches a corresponding 
requesf. Nofe fhaf on Linux, fhe nsupdate program can be used fo updafe a 
cooperafive DNS server. DNS servers cooperafe wifh a requesfed updafe only if an 
aufhenficafion and access confrol procedure indicafes fhaf fhe requesf is accepfable. 
This can be as simple as nofhing or lisfing fhe IP addresses of clienfs af fhe server, 
neifher of which is very secure, or using somewhaf more complex and secure mefh- 
ods fhaf provide transaction authentication (see TSIG and SIG(O) in Ghapfer 18). 


11.5.8 Zone Transfers and DNS NOTIFY 

A zone fransfer is used fo copy a sef of RRs for a zone from one server fo anofher 
(generally from fhe masfer server fo slave servers). The purpose of doing so is fo 
keep mulfiple servers in sync wifh respecf fo a zone's confenfs. Mulfiple servers 
provide resiliency fo failure, in case a server should go down. Performance can 
also be improved as mulfiple servers can be used fo share fhe processing load for 
incoming queries. Finally, fhe lafency of a DNS query/response can pofenfially be 
reduced if servers are placed in locafions close fo clienfs (i.e., where fhe nefwork 
lafency befween resolver and server is small). 
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As originally specified, zone transfers are initiated after polling, where slaves 
periodically contact masters to see if a zone transfer is necessary by comparing the 
zones' version numbers. A later method says if a zone transfer needs to be initi¬ 
ated using an asynchronous update mechanism when the zone contents change. 
This is called DNS NOTIFY. Once a zone transfer is initiated, either the entire 
zone is transferred (using DNS AXFR messages) [RFC5936], or an incremental zone 
transfer option may be used (using DNS IXFR messages) [RFC1995]. The general 
scheme operates according to the illustration in Figure 11-18. 


Slave/ 

Secondary 


Master/ 

Primary 


...notify 
“SOA Query- 


Check Serial 
Number 


L^SOA Response. 1 




L_Zone contents I 


Produce Update 
and Record 
Results for IXFR 
Requests 


Figure 11-18 A DNS zone transfer copies the contents of zones between servers. An optional notifi¬ 
cation can cause a slave to request a full or incremental zone transfer. 


We will now have a closer look at each of the options, including full and incre¬ 
mental zone transfers, plus DNS Notify. 

11.5.8.1 Full Zone Transfers (AXFR Messages) 

Full zone transfers are controlled by the zone transfer parameters carried in a zone's 
SOA record: primary name server, serial number, and the refresh, retry, and expire 
intervals. When configured, a slave server attempts to contact the primary server to 
see if a zone transfer is necessary. Contacts are attempted periodically, according to 
the refresh interval. They are also attempted when a server first starts. If a contact 
is not successful (no response from the server), retries are attempted periodically 
according to the retry interval (generally shorter than the refresh interval). The 
entire zone contents are flushed if not refreshed within the expire interval, effec¬ 
tively incapacitating the server for the zone. 

An All Zone Transfer (AXFR) DNS message (a standard query containing 
type AXFR in the Question section) is used to request a complete zone transfer 
using TCP. To see such a message, we may arrange for a request to be initiated 
using the host program in our local network: 
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Linux% host -1 home. 

Using domain server: 

Name: 10.0.0.1 
Address: 10.0.0.1#53 
Aliases: 

home name server gw.home, 
ap.home has address 10.0.0.6 
gw.home has address 10.0.0.1 


The -1 flag asks the host program to perform a full zone transfer from a local 
DNS server. The program initiates a TCP-based query/response dialogue, illus¬ 
trated in Figure 11-19. 


dns-axfr.tr - Wireshark EDEllS 


Rie Edit View Go Capture Analyze Statistics Telephony Tools Help 




No. 

Time 

Source 

Destination 

Protocol 

Into 

1 

0.000000 

10.0.0.120 

10.0.0.1 

TCP 

'49173 > 53 [SYN] Seq=0 win=65535 Len=0 MSS“1460 WS=3 TSV- 

_2, 

0.000194 

10.0.0.1. 

10.0.0.120 

_TCP_ 

53 > 49173 rSYN. ACKl 5ea=Q Ack==l Win=5792 Len=0 MSS=14^ 

3 

0.001130 

10.0.0.120 

10.0.0.1 

TCP 

49173 > 53 [ACK] 5eq=l Ack^L win=524280 Len=0 TSV=1058031 







5 

0.001375 

10.0.0.1 

10.0.0.120 

TCP 

53 > 49173 [ack] Seq=l Ack=25 Win=5792 Len=0 TSV=17198784 

6 

0.019595 

10.0.0.1 

10.0.0.120 

DNS 

Standard query response SOA gw. home ns gw. home A 10.0.0.6 

7 

0.021133 

10.0.0.120 

10.0.0.1 

TCP 

49173 > 53 [ACK] Seq*25 Ack=300 Win=524280 Len=0 TSV=1058 

8 

0.021609 

~10.0.CT20" 

10.0.0.1 

TCP 

49173 > 5rTFIN, ACKT Seq=25 Ack=300 ■WTn=524280 Len=0 TSV 

9 

0.023144' 

10.0.0.1' 

10.0.0.120 

TCP 

'53 >"49173 ■Jfin,] ACkF 56(^300 Ack“=26 ■WTn=5792 Len=0 TSV='l 

10 

0.024645 

10.0.0.120 

10.0.0.1 

TCP 

49173 > 53 [ACK]*Seq=26 Ack=301 win=524280 Len=0 T5V=1058 • 


> 


S Frame 4: 90 bytes on wire (720 bits), 90 bytes captured (720 bits) 

S Ethernet II, Src: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d:91), Dst: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80) 
a internet protocol, src: 10.0.0.120 (10.0.0.120), Dst: 10.0.0.1 (10.0.0.1) 

a Transmission control protocol, src port: 49173 (49173), Dst port: 53 (53), seq: 1, Ack: 1, Len: 24 
a Domain Name System (query) 
fResponse in: 61 
Length: 22 

Transaction ID: 0xee03 
a Flags: 0x0000 (standard query) 

Questions: 1 
Answer rrs: 0 
Authority RRs: 0 
Additional RRs: 0 
a Queries 

a home: type axfr, class in 
N ame: home 

Type: axfr (Request for full zone transfer) 
class: in (0x0001) 

> 


Figure 11-19 A DNS request for a full zone transfer uses the AXFR record type and TCP as a trans¬ 
port protocol. 


In Figure 11-19 we can see how the zone transfer takes place using TCP The 
first three TCP segments are part of the standard TCP connection establishment 
process (see Chapter 13). The fourth (decoded) packet is the request. It is a nor¬ 
mal DNS standard, with type AXFR and class IN (Internet). The query is for the 
domain name home. The response to this query is contained in message 6, follow¬ 
ing the TCP ACK (see Figure 11-20). 
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C!. 4= ♦ « ^ a 

HlQGtGtGtniiHW .. 

No. Time Source Destination 

Protocol 

Info 

A 

5 0.001375 10.0.0.1 10.0.0.120 TCP 

53 > 49173 

[ACK] seq=l Ack=25 win=5792 Len=0 Ti 



> 


S Frame 6: 365 bytes on wire (2920 bits), 365 bytes captured (2920 bits) 


ffl Ethernet II, Src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 00:17:f2:e7:6d:91 (00:17:f2: 
±1 Internet Protocol, src: 10.0.0.1 (10.0.0.1), Dst: 10.0.0.120 (10.0.0.120) 

a Transmission control Protocol, src Port: 53 (53), Dst Port: 49173 (49173), Seq: 1, Ack: 25 
S Domain Name system (response) 
rPeauesT in: 41 
[Time: 0.018346000 seconds] 

Length: 297 

Transaction id: 0xee03 

B Flags: 0x8480 (Standard query response, No error) 

Questions: 1 
Answer rrs: 12 
Authority RRs: 0 
Additional RRs: 0 
B Queries 

B home: type axfr, class in 
N ame: home 

Type: AXFR (Request for full zone transfer) 
class: IN (0x0001) 

B Answers 

B home: type SOA, class IN, mname gw.home 
B home: type NS, class IN, ns gw.home 
B ap.home: type a, class in, addr 10.0.0.6 
B dns.home: type cname, class IN, cname gw.home 
B dsl.home: type A, class IN, addr 10.0.0.138 
B extern.home: type A, class IN, addr 10.0.0.129 
a fax.home: type cname, class in, cname hp.home 
Bfc8.home: type A, class IN, addr 10.0.0.13 
a gw.home: type A, class in, addr 10.0.0.1 
a hp.home: type A, class in, addr 10.0.0.14 
a scanner.home: type cname, class in, cname hp.home 
a home: tvoe SOA. class in. mname aw.home 

I > 


Figure 11-20 The successful response for a full zone transfer request includes all of the records for 
the zone. The transaction takes place using TCP, as the zone contents may be large and 
a reliable copy is required. 


In Figure 11-20 we can see how the entire zone is carried in the response. After 
receiving the response, the client's TCP ACKs the data and initiates a TCP con¬ 
nection close. The connection is closed gracefully using fhe FIN-ACK handshake 
(packefs 8-10). See Chapfer 13 for more defails on fhe sfandard TCP connecfion 
esfablishmenf and clearing. 

Alfhough if used fo be possible fo perform such zone fransfers wifh virfually 
any DNS server, fhey are now fypically resfricfed fo fhe aufhorifafive servers in a 
zone (e.g., fhose lisfed in NS records for fhe zone). The reason for fhis resfricfion 
is privacy and securify—knowledge of fhe hosfs wifhin fhe zone mighf help an 
attacker fargef parficular services or hosfs. 

11.5.8.2 Incremental Zone Transfers (IXFR Messages) 

To improve fhe efficiency of zone fransfers, [RFC1995] defines fhe use of incremen¬ 
tal zone fransfers. Using incremenfal zone fransfers and fhe IXFR message type, 
only the changes in a zone are provided. To execute an incremental zone transfer, 
fhe clienf (e.g., slave server) musf provide ifs currenf serial number for fhe zone. In 
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the following example, we can emulate a requesting server by providing the serial 
number and using the dig program: 

Linux% dig +short @10.0.0.1 -t ixfr=1997022700 home. 

gw.home, hostmaster.gw.home. 1997022700 10800 15 604800 10800 

The command line indicates that output from the command should be short, 
10.0.0.1 is the address of the DNS server to use, and an incremental zone transfer 
starting with serial number 1997022700 should be performed. This example cre¬ 
ates an exchange similar to the one illustrated in Figures 11-19 and 11-20 for AXFR, 
except in this case the serial number of the request matches the current serial 
number (see Figure 11-21). 


5 dns-Txfr.tr - Wireshark 


File Edit View Go Capture Analyse Statistics Telephony lools Help 

Blii(#l«tKIB0Xi3aia.<Ji iHiCi Q. Gl, □ 

No. Time Source Destination Protocol Info 

1 0.000000 10.0.0.120 10.0.0.1 'TCP 49193 > 53 [SYN] seq=0 w1n=65535 Len=0 MS 

2 0.000149 10.0.0.1_10.0.0.120 TCP 53 >_49193 JSYN, ACK]_ Seq_=0 Ack=l^in=579 

3 0.001089 10.0.0.120 10.0.0.1 TCP" 49193 > 53 [ACK] Seq=l Ack=l win=524280 L 


4 0.001559 10.0.0.120 10.0.0.1 DNS Standard query IXFR home 


5 0.001642 10.0.0.1 10.0.0.120 TCP 53 > 49193 [ACK] seq=l Ack=59 Win=5792 Le 

6 0.002563 10.0.0.1 10.0.0.120 DNS Standard query response SOA gw.home 

70.003636 10.0.0.120 10.0.0.1 TCP 49193 > 53 [ACK] Seq=59 Ack=75 Win=524280 

8 0.007293 10.0.0.120 1070: 0.1 TCP 49l93 > 53 ~IF1UrACKT Se(^59 Ack=75 Wln=5 

“2 9 0.007613 "10. 0.0. r_10. 0. 0.120 TCP 53 ^49193 ^FIN, ACK^Seq'^75 Ack”=60^n=5 

100.008556 10.0.0.120 10.0.0.1 TCP" 49193 > 53 [ACK] Seq=60 Ack=76 Win=524280 

> 

ti Frame 4: 124 bytes on wire (992 bits), 124 bytes captured (992 bits) 
ffl Ethernet II, Src: 00:17:f2:e7:6d:91 (00:17:f2:e7:6d:91), Dst: 00:04:5a:9f:9e:80 (0 
IS internet protocol, src: 10.0.0.120 (10.0.0.120), Dst: 10.0.0.1 (10.0.0.1) 
a Transmission control Protocol, src Port: 49193 (49193), Dst Port: 53 (53), Seq: 1, 
a Domain Name System (query) 

[Response in: 61 
Length: 56 

Transaction ID: 0xf390 
a Flags: 0x0000 (Standard query) 

Questions: 1 
Answer rrs: 0 
Authority RRs: 1 
Additional RRs: 0 
a Queries 

a home: type ixfr, class in 
N ame: home 

Type: ixfr (Request for incremental zone transfer) 
class: IN (0x0001) 
a Authoritative nameservers 

a home: type SOA, class in, mname <Root> 

Name: home 

Type: SOA (start of zone of authority) 
class: IN (0x0001) 

Time to live: 0 time 

Data length: 22 

Primary name server: <Root> 

Responsible authority's mailbox: <Root> 

serial number: 1997022700 

Refresh interval: 0 time 

Retry interval: 0 time 

Expiration limit: 0 time 

Minimum ttl: 0 time 

> 


Figure 11-21 An incremental zone transfer request (IXFR record type) carried on TCR The serial 
number is used to determine which records, if any, have changed since an earlier zone 
transfer took place. 
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Figure 11-22 shows how the IXFR request includes a mostly empty SOA RR 
in the authority section. The SOA record includes the serial number specified 
(1997022700). The response (packet 6) contains no real information because this 
serial number matches the current one at the server. 


dns-ixfr.tr - Wireshark I3I§IS 


File Edit View Go Capture Analyze Statistics Telephony lools Help 

1^1 □ €1 Q. Q. □ 

No. Time Source Destination Protocol Info 

1 0.000000 10.0.0.120 10.0.0.1 “TCP 49193 > 53 [SYN] 5eq=0 w1n=65535 Len=0 MS 

2 0.000149 10.0.0.1_10.0.0.120 TCP 53 ^ 49193 JSYN, ACKl, Seq_=0 Aci=l ^in=579 

3 0.001089 10.0.0.120 10.0.0.1 TCP" 49193 > 53 [ACK] Seq=l Ack=l Win=524280 L 

4 0.001559 10.0.0.120 10.0.0.1 DNS Standard query IXPR home 

5 0.001642 10.0.0.1 10.0.0.120 TCP 53 > 49193 [ACK] Seq=l Ack=59 win=5792 Le 


6 0.002563 10.0.0.1 10.0.0.120 DNS Standard query response SOA gw. home 


lu.u.u.lJ'j 10.0.0.1 TCP 49193 > 53 [ACK] Seq=59 Ack=75 Win=524280 

8 0.007293 10.0.0.120 10.0.0.1 “TCP 49193“ 53 [FIN, 'ACKT 5ei^59 ACk'^TS ■win=5 

_9 0.007613 10.0.0.1_10.0.0.120 TCP 53 >_49193 JFIN, ACK]_Seq-75 Ack=60^in=5 

10 0.008556 10.0.0.120 10.0.0.1 TCP’ 49193 > 53 [ACK] Seq=60 Ack=76 Win=524280 

> 

a Frame 6: 140 bytes on wire (1120 bits), 140 bytes captured (1120 bits) 
a Ethernet ii, src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 00:17:f2:e7:6d:91 (0 
a internet Protocol, src: 10.0.0.1 (10.0.0.1), Dst: 10.0.0.120 (10.0.0.120) 
♦Transmission control Protocol, src Port: 53 (53), Dst Port: 49193 (49193), Seq: 1, 
S Domain Name System (response) 

[Peauest in: 4l 

[Time: 0.001004000 seconds] 

Length: 72 

Transaction ID: 0xf390 

a Flags: 0x8480 (standard query response, no error) 

Questions: 1 
Answer RRs: 1 
Authority RRs: 0 
Additional rrs: 0 
a Queries 

a home: type IXFR, class IN 
Name: home 

Type: ixfr (Request for incremental zone transfer) 
class: IN (0x0001) 
a Answers 

a home: type soa, class in, mname gw.home 
Name: home 

Type: SOA (start of zone of authority) 
class: IN (0x0001) 

Time to live: 1 day 

Data length: 38 

Primary name server: gw.home 

Responsible authority's mailbox: hostmaster.gw.home 

serial number: 1997022700 

Refresh interval: 3 hours 

Retry interval: 15 seconds 

Expiration limit: 7 days 

Minimum ttl: 3 hours 

> 


Figure 11-22 The response to an IXFR request when the serial number is current contains only an 
SOA record and no additional information. 


The response in Figure 11-22 confains only fhe SOA RR in fhe answer sec- 
fion. Unlike fhe one confained in fhe query, fhis one is filled in wifh fhe complefe 
SOA fields (e.g., mailbox, zone fransfer paramefers). However, fhere are no addi- 
fional answers because fhe currenf serial number for fhe zone mafches fhaf of fhe 
requesf. Thus, fhe requesfing clienf is assumed fo be up-fo-dafe and nof in need of 
any addifional informafion or a zone fransfer. 
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11.5.8.3 DNS NOTIFY 

As mentioned previously, polling has traditionally been used to determine the 
need for zone transfers, meaning fhaf fhe slave servers would check wifh a masfer 
periodically (fhe "refresh" inferval) fo see if fhe zone had been updafed (indicafed 
by a differenf serial number), in which case a zone fransfer would be inifiafed. 
This is a somewhaf wasfeful process because many useless polls may occur before 
fhe zone is updafed. To improve fhe sifuafion, [RFC1996] developed fhe DNS 
NOTIFY mechanism. DNS NOTIFY allows a server wifh modified zone confenfs 
fo nofify slave servers fhaf an updafe has been made and a zone fransfer should 
be inifiafed. More specifically, if enabled, a nofificafion message is senf fo a sef 
of inferesfed servers if fhe SOA RR for a zone changes (e.g., if fhe serial number 
increases). This allows zone fransfers fo be inifiafed easily when required. Using a 
local (home) name server, we can see how fhis works (see Figure 11-23). 





File Edit View Go Capture Analyze Statistics 

Telephony lools Help I 
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# ® ^ a i]li|l3 d Q .. 

No. Time Source Destination 

Protocol 
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2 15.019645 10.0.0.1 10.0.0.6 

DNS 

Zone change notification SOA home 

3 30.020745 10.0.0.1 10.0.0.6 

DNS 

Zone change notification soA home 

> 

(S Frame 1: 114 bytes on wire (912 bits), 114 bytes captured (912 bits) 1 


a Ethernet II, Src: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80), Dst: 00:02:6f:2e:19:4c 
a Internet Protocol, src: 10.0.0.1 (10.0.0.1), Dst: 10.0.0.6 (10.0.0.6) 
a user Datagram Protocol, src Port: 2632 (2632), Dst Port: 53 (53) 

S Domain Name system (query) 

Transaction ID: 0x4436 
a Flags: 0x2400 (Zone change notification) 


0.= Response: Message is a query 

.010 0.= opcode: zone change notification (4) 

.0.= Truncated: Message is not truncated 

.0. = Recursion desired: Don't do query recursively 

.0.= z: reserved (0) 

.0 .... = Non-authenticated data: unacceptable 

Questions: 1 
Answer rrs: 1 
Authority RRs: 0 


Additional rrs: 0 
a Queries 

a home: type SOA, class in 
N ame: home 

Type: soa (start of zone of authority) 
class: IN (0x0001) 
a Answers 

a home: type SOA, class IN, mname gw.home 
Name: home 

Type: soa (start of zone of authority) 
class: IN (0x0001) 

Time to live: 0 time 

Data length: 38 

Primary name server: gw.home 

Responsible authority's mailbox: hostmaster.gw.home 

serial number: 1997022701 

Refresh interval: 3 hours 

Retry interval: 15 seconds 

Expiration limit: 7 days 

Minimum ttl: 3 hours 


Figure 11-23 A DNS NOTIFY indicating an update to the zone file. There are two retransmissions 
spaced 15s apart (contrary to the method suggested in the standard). 
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This example illustrates the simple DNS NOTIFY message sent to a host in the 
server's notify set of servers that should be informed of a zone change. The message 
is a UDP/IPv4 DNS query message with the Flags field indicating a zone change 
notification. The query section contains the type and class for an SOA record, and 
the answer section contains the current SOA RR for the zone (with TTL 0), includ¬ 
ing the serial number. This provides sufficient information for a notified server 
to determine that a zone transfer may be necessary. Note that a single server may 
receive notifications from multiple other servers as they update their zone infor¬ 
mation. This does not present a problem for the protocol's operation. 

The DNS NOTIFY mechanism defaults to using UDP, an unreliable protocol. 
In this particular example, the notify set contains only the address 10.0.0.11, which 
does not run a DNS server. Consequently, the message is resent every 15s hoping 
for a response that never arrives. 


Note 

The time between retransmissions and the totai number of retransmissions to 
attempt are suggested by [RFC1996] to be 60s and five retransmissions, respec- 
tiveiy. It also suggests that a timer backoff method (additive or exponentiai) be 
used. Here we can see that the BIND9 impiementation faiis to respect these sug¬ 
gestions, as the two retransmissions are 15s apart. 


Responses are simply DNS response messages with no useful information 
except the transaction ID; they are used only to complete the protocol and cancel 
retransmissions at the sending server. 


11.6 Sort Lists, Round-Robin, and Spiit DNS 

So far we have discussed how domain names are set up, the types of resource 
records DNS supports, and the DNS protocol used to fetch and update a zone. One 
subtle point to consider is what data is returned and in what order in response 
to a DNS query. A DNS server could return all matching data to any client in 
whatever order the server finds most convenient. However, special configuration 
options and behaviors are available in most DNS server software to achieve cer¬ 
tain operational, privacy, or performance goals. Consider the topology shown in 
Figure 11-24. 

The type of topology shown in Figure 11-24 is typical of a small enterprise. 
There is a private network and a public network including a DNS server. In addi¬ 
tion, there is a pair of hosts on the DMZ (A and B), one on the internal network (C) 
and one on the Internet (R). A multihomed host (M) spans the DMZ and internal 
networks. M therefore has two IP addresses drawn from two different network 
prefixes. 



566 


Name Resolution and the Domain Name System (DNS) 


DNS Server 



Figure 11-24 In a small enterprise topology, DNS may be configured to return different addresses 
depending on the requesting IP address. 


A host wishing to contact M performs a DNS lookup that returns two 
addresses—one associated with the internal network and one with the DMZ. 
Naturally, it would be more efficient if A, B, and R reached M via fhe DMZ and 
C reached M via fhe infernal nefwork. This generally happens if fhe DNS server 
orders ifs refurned address records based on fhe source IP address of fhe requesf. 
(If could also use fhe desfinafion IP address, especially if M uses mulfiple IP 
addresses from differenf subnefs on fhe same nefwork inferface.) If fhe requesf- 
ing sysfem uses a source IP address wifh fhe same nefwork prefix as fhe source 
of a refurning address record, fhe DNS server places fhe sef of such mafching 
records early in fhe refurned message. This behavior encourages fhe clienf fo find 
fhe "closesf" IP address for a parficular server if is affempfing fo confacf, because 
mosf simple applicafions affempf fo confacf fhe firsf address found among fhe 
refurned address records. The precise behavior can usually be confrolled using 
a so-called sortlist or rrset-order direcfive (opfions used in configurafion 
files for resolvers and servers). Such sorfing behavior may also happen aufomafi- 
cally if performed by fhe DNS server soffware by defaulf. 

A somewhaf relafed sifuafion arises when one service is offered using more 
fhan one server such fhaf fhe incoming connecfions are load-balanced (i.e., divided 
among fhe servers). In fhe preceding example, imagine fhaf a service is offered on 
bofh A and B. Such a service may be idenfified by fhe URL http: //www. example 
.com. Requesfing clienfs (like R) perform a DNS query on fhe domain name 
www.example.com, and fhe DNS server evenfually refurns a sef of address 
records. To achieve load balancing, fhe DNS server may be configured fo use 
DNS round-robin, which means fhaf fhe server permufes fhe order of fhe refurned 
address records. Doing so encourages each new clienf fo access fhe service on a 
differenf server from fhe previous clienf. While fhis helps fo balance load, if is far 
from perfecf. When records are cached, fhe desired effecf may nof occur because 
of reuse of exisfing cached address records. In addifion, fhis scheme may bal¬ 
ance fhe number of connecfions well across servers, buf nof fhe load. Differenf 
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connections can have radically different processing requirements, so the true pro¬ 
cessing load is likely to remain unbalanced unless the particular service always 
has the same processing requirements. 

A final considerafion regarding fhe dafa refurned by a DNS server is sup- 
porf for privacy. In fhis example, we may wish fo arrange for hosfs wifhin fhe 
enferprise fo be able fo refrieve resource records for every compufer in fhe nef- 
work, while we limif fhe sef of sysfems fhaf remain visible fo R. A fechnique for 
implemenfing fhis goal is called split DNS. In splif DNS, fhe sef of resource records 
refurned in response fo a query is dependenf on fhe idenfify of fhe clienf and pos¬ 
sibly query desfinafion address. Mosf offen, fhe clienf is idenfified by IP address 
or address prefix. Wifh splif DNS, we could arrange for any hosf in fhe enferprise 
(i.e., fhose sharing a sef of prefixes) fo be provided wifh fhe enfire DNS dafabase, 
whereas fhose oufside are given visibilify only fo A and B, where fhe main Web 
service is offered. 


11.7 Open DNS Servers and DynDNS 

Many home users are assigned a single IPv4 address by fheir ISP, and fhis address 
may change over fime as fhe user's compufer or home gafeway connecfs, discon- 
necfs, and reconnecfs fo fhe Infernef. Consequenfly, if is offen difficulf for fhe user 
fo esfablish a DNS enfry fhaf allows for running services fhaf are visible from 
fhe Infernef. A number of so-called open Dynamic DNS (DDNS) servers are avail¬ 
able fhaf supporf a special updafe protocol called fhe DNS Update API [DYNDNS], 
whereby a user may updafe an enfry in a provider's DNS server given a preregis- 
frafion or accounf. This scheme does not use fhe [RFC2136] DNS UPDATE protocol 
described earlier buf is insfead a separafe applicafion-layer profocol. 

To use fhe service, a DDNS clienf program (e.g., inadyn or ddclient on 
Linux and DynDNS Updater for Windows) runs on fhe clienf system, which could 
also be a user's home roufer. Mosf offen, fhese programs are configured wifh login 
informafion used fo access a remofe DDNS service. When fhe service is invoked, 
fhe clienf program confacfs fhe server, provides fhe currenf global IP address of ifs 
hosf (fhe one assigned by an ISP, offen a NAT mapped address), and goes quiescenf. 
Affer fhaf, if periodically renews fhe informafion wifh fhe server. Doing so allows 
fhe server fo clear fhe informafion if an updafe is nof received wifhin a cerfain 
fime inferval. Such services include fhose provided af fhe following Web sifes (as of 
2011): http: //www.dyndns. com/services/dns /dyndns, http: //freedns 
.afraid.org, and http: //www.no-ip.com/services/managed_dns/free_ 
dynamic_dns. html. 


11.8 Transparency and Extensibility 

The DNS is one of fhe mosf ubiquifous services on fhe Infernef and has been 
an affracfive service fo consider as a basis for adding new capabilifies fhrough 
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extensions. There are, for example, numerous record types such as TXT, SRV, and 
even A (e.g., see [RFC5782]) that could be used for encoding data useful for vari¬ 
ous future services. [RFC5507] considers various methods for extending the DNS, 
ultimately concluding that creation and implementation of new RR types is the 
most attractive approach. Thanks to an earlier specification [RFC3597], there is a 
standard method for handling unknown RR types as opaque data. That is, they 
are not interpreted unless recognized; the processing is transparent. This allows 
for new RR types to be carried along without causing negative impact on the pro¬ 
cessing of existing RR types. 

One complication with preserving transparency is the encoding of embed¬ 
ded domain names and compression. For known RR types, embedded domain 
names are permitted to have their cases altered in order to achieve compression 
with compression labels. Owner domain names (the "keys" of queries) are always 
subject to compression. For unknown RR types, however, embedded domain 
names are not permitted to use compression labels. In addition, future RR types 
that contain embedded domain names are likewise prohibited (see Section 4 of 
[RFC3597]). Unknown types can still be compared (e.g., for dynamic updates) in 
a bitwise fashion. This implies that any embedded domain names are compared 
in a case-sensitive manner [RFC4343], contrary to most other DNS operations. This 
same situation appears for embedded domain names used with TXT records. 

A different issue arises regarding transparency when new forms of servers 
and proxies are introduced that process DNS traffic. It is now relatively common 
practice to include a DNS proxy colocated inside a home gateway or firewall. A 
typical proxy handles incoming DNS requests from a user's home network and 
forwards the request to an ISP-provided name server. It also receives returned 
information and may or may not cache the results. Historically, some proxies have 
tried to do more than merely relay requests and replies, and this has caused some 
problems with DNS interoperability. [RFC5625] specifies the proper operation 
of a DNS proxy, essentially requiring DNS RRs to be uninterpreted and merely 
relayed by the proxy. In cases where packet truncation cannot be avoided, any 
such proxy must set the TC bit field to indicate that some DNS data was removed. 
Furthermore, any such proxies should be prepared to handle TCP requests, as this 
is the conventional fallback mechanism when a previous UDP-based request was 
truncated and is required by [RFC5966]. 


11.9 Translating DNS from IPv4 to IPv6 (DNS64) 

In Chapter 7 we described a framework for translating IP datagrams back and 
forth between IPv4 and IPv6. Translators supporting such capabilities are envi¬ 
sioned to be deployed with a related capability that translates between DNS A 
and A AAA records [RFC6147], allowing IPv6-only clients to access DNS informa¬ 
tion that appears in A records (e.g., in the IPv4 Internet). The capability is called 
DNS64, and one of its proposed deployment scenarios (called "DNS64 in DNS 
recursive-resolver mode") is illustrated in Figure 11-25. 
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Figure 11-25 DNS64 translates A records to AAAA records and works together with an IPv4/IPv6 translator 
to allow IPv6-only clients to access services in IPv4 networks. 


As shown in Figure 11-25, DNS64 is used in conjunction with an IPv4/IPv6 
translator (see Chapter 7). Each device is configured with one or more common 
IPv6 prefixes used in creating IPv4-embedded addresses. Each prefix may be a 
Network-Specific Prefix (e.g., that is owned by an operator) or the Well-Known 
Prefix (64:ff9b::/96). The DNS64 device acts as a caching DNS server. IPv6-only 
clients use it as the primary DNS server and are able to request AAAA records for 
domain names. DNS64 converts such requests to requests for both A and AAAA 
records on its IPv4 side. If no AAAA records are returned, DNS64 provides syn¬ 
thetic AAAA records by forming an IPv4-embedded address based on the config¬ 
ured prefix and the contents of each A record it retrieves. DNS64 also responds to 
PTR queries for any of the IPv6 prefixes it uses for synthesizing AAAA RRs. 

To implement AAA RR synthesis in a DNS64 device, only the answer sec¬ 
tion of a DNS message is effectively altered. Other sections remain as they appear 
when retrieved on the IPv4 side. In cases of CNAME or DNAME chains, the chain 
is followed recursively until an A or AAAA record is found and the elements of 
the chain are included in the response. In addition, DNS64 may be configured so 
as to avoid synthesis for particular excluded IPv6 or IPv4 address ranges. This 
prevents certain anomalous behavior (e.g., forming IPv4-embedded addresses 
based on special-use IPv4 addresses). Note that DNS64 has subtle interactions 
with DNSSEC; these issues are covered in Chapter 18. 


11.10 LLMNR and mDNS 

The ordinary DNS system requires a set of DNS servers to be configured to provide 
mappings between names and addresses, and possibly other information. Some¬ 
times this is too much overhead when only a few local hosts wish to communicate. 
In cases where a DNS server is not available (e.g., a quickly formed ad hoc network 
of clients that connect only to each other), a special local version of DNS called 
Link-Local Multicast Name Resolution (LLMRR) [RPC4795] may be available. It is a 
(nonstandard) protocol based on DNS developed by Microsoft and used in local 
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environments to help discover devices on a local area network, such as printers 
and file servers. It is supported in Windows Vista, Server 2008, and 7. It uses UDP 
port 5355 with the IPv4 multicast address 224.0.0.252 and IPv6 address ff02::l:3. 
The servers also use TCP on port 5355 from whafever unicasf IP address fhey 
respond from. 

Multicast DNS (mDNS) [IDMDNS] is anofher form of local DNS-like capabil- 
ify developed by Apple. When if is combined wifh fhe DNS Service Discovery 
profocol, Apple calls fhe resulfing framework Bonjour. mDNS uses DNS messages 
carried over local mulficasf addresses. If uses UDP wifh porf 5353. If specifies 
fhaf fhe special TLD . local is fo be freafed wifh special semanfics. The . local 
TLD is link-local in scope. Any DNS queries for domain names in fhis TLD are 
senf fo fhe mDNS IPv4 address 224.0.0.251 or fhe IPv6 address ff02::fb. Queries 
for ofher domains may opfionally be senf fo fhese mulficasf addresses. Allow¬ 
ing link-local servers fo respond fo mappings for global names can raise signifi- 
canf securify concerns. To combaf fhis problem, DNSSEC can be employed (see 
Chapfer 18). mDNS supporfs autonomous assignmenf of names in fhe .local 
pseudo-TLD, alfhough fhis pseudo-TLD has nof been officially reserved for fhis 
purpose [RFC2606]. Thus, hosfs on small nefworks such as home LANs can be 
assigned convenienf names such as printer, local, f ileserver. local, cam- 
eral.local, kevinlaptop.local, and fhe like. A mechanism in mDNS is used 
fo defecf and resolve conflicfs. 


11.11 LDAP 

So far we have discussed DNS and local name services fhaf resemble DNS. To 
supporf richer queries and dafa manipulafions, fhere is a more general direcfory 
service we menfioned earlier called LDAP [RFC4510]. LDAP (now LDAPv3) is 
an applicafion profocol for fhe Infernef fhaf provides access fo general direcfo- 
ries (e.g., "while pages") in accordance wifh fhe X.500 (1993) [X500] dafa and ser¬ 
vice models. If provides fhe abilify fo search, modify, add, compare, and remove 
enfries based on user-selecfed patterns. An LDAP direcfory is a free of direcfory 
enfries, where each enfry consisfs of a sef of affribufes. As TCP/IP has become 
more popular, LDAP has evolved from ifs roofs fo work in conjuncfion wifh DNS. 
For example, a query abouf direcfory enfries mafching fhe chancellor's office af 
MIT could be formed using fhe LDAP search fool Idapsearch (Microsoft has a 
comparable fool called Idp available as a supporf fool from ifs Web site), which 
works as follows: 


Linux% Idapsearch -x -h ldap.mit.edu -b "dc=mit,dc=edu" \ 
"(ou=*Chancellor*)" 

# extended LDIF 

# 

# LDAPv3 

# base <dc=mit,dc=edu> with scope sub 
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# filter: (ou=*Chancellor*) 

# requesting: ALL 

# 


The command line indicates that the server ldap.mit.edu should be 
contacted without using any special authentication protocol (-x option). While a 
complete discussion of LDAP is well beyond the scope of fhis chapfer (and 
book!), fhe parfial oufpuf shows how fhe dc (domain componenf) affribufe is 
used fo link LDAP dafa wifh fhe DNS. Each dc componenf holds one DNS label, 
and fogefher fhey can be used fo encode an enfire domain name, which is used as 
fhe "base" portion for fhe LDAP query. Using fhis convenfion, if is nof especially 
difficulf fo form valid LDAP queries. In fhis case, if is for fhe organizafional unif 
(ou) confaining fhe word Chancellor. Nofe fhaf wildcards can be used. 

LDAP servers are used mosf often within enterprises to hold directory information 
such as location, telephone number, and organizational unit. Microsoft's Active 
Directory product includes LDAP capabilities and is used extensively for manag¬ 
ing user accounfs, services, and access righfs in large enferprises using Windows. 
Some LDAP servers (such as MIT's and fhose of many ofher universifies) are also 
available fhrough fhe public Infernef. 


11.12 Attacks on the DNS 

The DNS is a crifical componenf of fhe Infernef and has been fhe objecf of sev¬ 
eral attacks and counfermeasures over fhe years [RFC3833]. Relafively recenfly, 
a global efforf called DNS Securify (DNSSEC) has made subsfanfial progress in 
adding sfrong aufhenficafion fo DNS operafions. We defer fhe defailed discussion 
of how DNSSEC works fo Chapfer 18, where we also cover fhe necessary cryp¬ 
tography background. We now explore some of fhe affacks fhaf have been waged 
againsf fhe DNS. 

There have been fwo main forms of affacks againsf fhe DNS. The firsf form 
involves a DoS attack where fhe DNS is rendered inoperafive because of overload¬ 
ing of imporfanf DNS servers, such as fhe roof or TLD servers. The second form 
alters fhe confenfs of resource records or masquerades as an official DNS server 
buf responds wifh bogus resource records, fhereby causing hosfs fo confacf fhe 
incorrecf IP address when attempting fo connecf fo anofher machine (e.g., a Web 
site such as a bank). 

The firsf major DoS affack on DNS took place in early 2001. The attack involved 
generating many requesfs for fhe MX records of AOL.COM. The affacker generafed 
DNS requesfs for an MX record using forged source IP addresses. The requesf 
is a relafively small packef, whereas fhe response is larger (by abouf a factor of 
20), so fhis type of affack is called an amplification affack because fhe amounf of 
bandwidfh consumed as fhe resulf of fhe affack is greafer fhan fhe amounf used 
in generafing fhe affack by a significanf factor. The responses are directed af fhe 
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IP address contained in the request packets, so the attacker could essentially cause 
the response traffic fo be direcfed wherever (s)he infended. The affack is docu- 
menfed in defail in a CERT incidenf nofe [CIN]. 

A form of affack involving modificafion of fhe dafa wifhin DNS was reporfed 
in lafe 2008 [CKB] and is now known as fhe Kaminsky Attack. If involves cache 
poisoning, where fhe cached confenfs of a DNS server are replaced wifh erroneous 
or forged dafa and ulfimafely served fo fhe resolvers on end hosfs. In one vari- 
anf, an affacker responds fo a caching server's query for an A record wifh an NS 
record for fhe domain using a parficular hosf domain name. The hosf's IP address 
(chosen by fhe affacker) is also provided in fhe addifional informafion secfion of 
fhe DNS response. The hosf domain name may or may nof share fhe same sub- 
domains as fhe original DNS requesf. The main risk associafed wifh fhis form of 
affack is fhaf clienfs fhaf depend on proper DNS name-fo-address resolufion may 
be direcfed fo fake servers. If such servers are infenfionally configured fo mimic 
fhe original hosf (e.g., masquerading as a bank's Web server), users may unwif- 
fingly frusf fhe masquerading server and divulge sensifive informafion. Mifiga- 
fion fechniques for fhis and ofher relafed affacks are given by [RFC5452]. One 
approach nof described in [RFC5452] called DNS-0x20 [DOS] involves encoding a 
nonce in fhe 0x20 bif posifion of each characfer in fhe Query Name parf of a ques- 
fion secfion fhaf is echoed back in fhe corresponding area of each response. This 
is made possible because, alfhough domain names are compared in a case-insen- 
sifive way, servers fend fo refurn an exacf copy of fhe Query Name when forming 
responses. If fhe case of fhe owner's name is infenfionally mixed up in fhe query, 
an unsolicifed response will have difficulfy reproducing fhe nonce, and can more 
readily be idenfified (and ignored). 


11.13 Summary 

The DNS is an essenfial parf of fhe Infernef, and DNS fechnology is widely used 
in privafe nefworks as well. The DNS name space is worldwide in scope and is 
divided info a hierarchy sfarfing wifh fop-level domains (TLDs). Domain names 
can be represenfed in mulfiple languages and scripfs using infernafionalized 
domain names (IDNs). Applicafions use resolvers fo confacf one or more DNS 
servers fo perform lookup fasks againsf a zone dafabase, such as converfing a hosf 
name fo an IP address and vice versa. Resolvers fhen confacf a local name server, 
and fhis server may acf recursively fo confacf one of fhe roof servers or ofher serv¬ 
ers fo fulfill fhe requesf. Mosf DNS servers, and some resolvers, cache informafion 
learned in order fo provide if fo subsequenf clienfs for some period of fime called 
fhe fime fo live (TTL). Queries and responses use a special DNS protocol fhaf 
works wifh eifher TCP or UDP The profocol also works wifh eifher IPv4 or IPv6, 
or any mixfure of fhe fwo. 

All DNS queries and responses have fhe same basic message formaf fhaf 
includes quesfions, answers, aufhorify informafion, and addifional informafion. 
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Resource records are used to hold most DNS information, and there are many 
such types: addresses, mail exchange points, pointers to names, among others. 
In the Internet, most DNS messages are carried using UDP/IPv4 and are limited 
to 512 bytes in length, but a special extension option (EDNSO) provides for longer 
messages and is required to support DNS security (DNSSEC), which we discuss 
in detail in Chapter 18. 

DNS supports some special features such as zone transfers and dynamic 
updates. Zone transfers (complete or incremental) are used to allow redundant 
slave servers to synchronize the zone contents with a master server, primarily for 
redundancy. Dynamic updates allow zone contents to be modified by an appli¬ 
cation using an online protocol. There are really two forms of this capability, 
one standardized by [REC2136] and used in enterprises and a nonstandard but 
very popular dynamic DNS capability that allows users assigned temporary IP 
addresses (e.g., on cable or DSL) to obtain a DNS entry so that services they pro¬ 
vide can be found by name throughout the world. 

DNS has been the subject of numerous attacks, ranging from DoS attacks that 
leave the DNS with limited capability, to cache poisoning attacks that can be used 
to make malicious servers appear to be legitimate. Various techniques have arisen 
to combat this problem, including cryptographic techniques (covered in Chap¬ 
ter 18) and modifications to DNS servers to be less accepting of unsolicited DNS 
responses. 
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TCP: The Transmission Controi 
Protocoi (Preiiminaries) 


12.1 Introduction 

So far we have been discussing protocols that do not include their own mecha¬ 
nisms for delivering data reliably. They may detect that erroneous data has been 
received, using a mathematical function such as a checksum or CRC, but they do 
not try very hard to repair errors. With IP and UDP, no error repair is done at all. 
With Ethernet and other protocols based on it, the protocol provides some number 
of retries and then gives up if it cannot succeed. 

The problem of communicating in environments where the communication 
medium may lose or alter the messages being delivered has been studied for 
years. Some of the most important theoretical work on the topic was developed 
by Claude Shannon in 1948 [S48]. This work, which popularized the term bit and 
became the foundation of the field of information theory, helps us understand the 
fundamental limits on the amount of information that can be moved across an 
information channel that is lossy (that may delete or alter bits). Information theory 
is closely related to the field of coding theory, which provides ways of encoding 
information so that it is as resilient as possible to errors in the communications 
channel. Using error-correcting codes (basically, adding redundant bits so that the 
real information can be retrieved even if some bits are damaged) to correct com¬ 
munications problems is one very important method for handling errors. Another 
is to simply "try sending again" until the information is finally received. This 
approach, called Automatic Repeat Request (ARQ), forms the basis for many com¬ 
munications protocols, including TCP. 
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12.1.1 ARQ and Retransmission 

If we consider not only a single communication channel but the multihop cascade 
of several, we realize that not only may we have the types of errors mentioned so 
far (packet bit errors), but there may be others. These problems might arise at an 
intermediate router and are the types of problems we brought up when discussing 
IP: packet reordering, packet duplication, and packet erasures (drops). An error- 
correcting protocol designed for use over a multihop communications channel 
(such as IP) must cope with all of these problems. Let us now explore the protocol 
mechanisms that can be brought to bear on them. After we discuss these in the 
abstract, we shall explore how they are used by TCP in the Internet. 

A straightforward method of dealing with packet drops (and bit errors) is to 
resend the packet until it is received properly. This requires a way to determine (1) 
whether the receiver has received the packet and (2) whether the packet it received 
was the same one the sender sent. The method for a receiver to signal to a sender 
that it has received a packet is called an acknowledgment, or ACK. In its most basic 
form, the sender sends a packet and awaits an ACK. When the receiver receives 
the packet, it sends the ACK. When the sender receives the ACK, it sends another 
packet, and the process continues. Interesting questions to ask here are (1) How 
long should the sender wait for an ACK? (2) What if the ACK is lost? (3) What if 
the packet was received but had errors in it? 

As we shall see, the first question turns out to be deep. Deciding how long to 
wait relates to how long the sender should expect to wait for an ACK. Determin¬ 
ing this may be difficult; we postpone the discussion of techniques for it until we 
discuss TCP in detail later (see Chapter 14). The answer to question 2 is easier: 
if an ACK is dropped, the sender cannot readily distinguish this case from the 
case in which the original packet is dropped, so it simply sends the packet again. 
Of course, the receiver may receive two or more copies in that case, so it must be 
prepared to handle that situation (see the next paragraph). As for the third ques¬ 
tion, we can appeal to the codes mentioned in Section 12.1. It is generally much 
easier to use codes to detect errors in a large packet (with high probability) using 
only a few bits than it is to correct them. Simpler codes are typically not capable 
of correcting errors but are capable of detecting them. That is why checksums and 
CRCs are so popular. In order to detect errors in a packet, then, we use a form of 
checksum. When a receiver receives a packet containing an error, it refrains from 
sending an ACK. Eventually, the sender resends the packet, which ideally arrives 
undamaged. 

Even with the simple scenario presented so far, there is the possibility that 
the receiver might receive duplicate copies of the packet being transferred. This 
problem is addressed using a sequence number. Basically, every unique packet gets 
a new sequence number when it is sent at the source, and this sequence number is 
carried along in the packet itself. The receiver can use this number to determine 
whether it has already seen the packet and if so, discard it. 

The protocol described so far is reliable but not very efficient. Consider what 
happens when the time to deliver even a small packet from sender to receiver (the 
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delay or latency) is large (e.g., a second or two, which is not unusual for satellite 
links) and there are several packets to send. The sender is able to inject a single 
packet into the communications path but then must stop until it hears the ACK. 
This protocol is therefore called "stop and waif." Ifs fhroughpuf performance (dafa 
senf on fhe nefwork per unif fime) is proporfional to M/R where M is fhe packef 
size and R is fhe round-frip fime (RTT), assuming no packefs are losf or irrepara¬ 
bly damaged in fransif. For a fixed-size packef, as R goes up, fhe fhroughpuf goes 
down. If packefs are losf or damaged, fhe sifuafion is even worse: fhe "goodpuf" 
(useful amounf of dafa fransferred per unif fime) can be considerably less fhan fhe 
fhroughpuf. 

For a nefwork fhaf doesn'f damage or drop many packefs, fhe cause for low 
fhroughpuf is usually fhaf fhe nefwork is nof being kepf busy. The sifuafion is 
similar fo using an assembly line where new work cannof enter fhe line unfil a 
complefe producf emerges. Mosf of fhe line goes idle. If we fake fhis comparison 
one sfep furfher, if seems obvious fhaf we would do beffer if we could have more 
fhan one work unif in fhe line af a fime. If is fhe same for nefwork communica- 
fion—if we could have more fhan one packef in fhe nefwork, we would keep if 
"more busy," leading fo higher fhroughpuf. 

Allowing more fhan one packef fo be in fhe nefwork af a fime complicates 
maffers considerably. Now fhe sender musf decide nof only when fo injecf a packef 
info fhe nefwork, buf also how many. If also musf figure ouf how fo keep fhe 
fimers when waifing for ACKs, and if musf keep a copy of each packef nof yef 
acknowledged in case refransmissions are necessary. The receiver needs fo have 
a more sophisficafed ACK mechanism: one fhaf can disfinguish which packefs 
have been received and which have nof. The receiver may need a more sophisfi¬ 
cafed buffering (packef storage) mechanism—one fhaf allows if fo hold "ouf-of- 
sequence" packefs (fhose packefs fhaf have arrived earlier fhan fhose expecfed 
because of loss or reordering), unless if simply wanfs fo fhrow away such pack¬ 
efs, which is very inefficienf. There are ofher issues fhaf may nof be so obvious. 
Whaf if fhe receiver is slower fhan fhe sender? If fhe sender simply injecfs many 
packefs af a very high rafe, fhe receiver mighf jusf drop fhem because of process¬ 
ing or memory limifafions. The same quesfion can be asked abouf fhe routers in 
fhe middle. Whaf if fhe nefwork infrasfrucfure cannof handle fhe rafe of dafa fhe 
sender and receiver wish fo use? 

12.1.2 Windows of Packets and Sliding Windows 

To handle all of fhese problems, we begin wifh fhe assumpfion fhaf each unique 
packef has a sequence number, as described earlier. We define a window of packefs 
as fhe collecfion of packefs (or fheir sequence numbers) fhaf have been injected by 
fhe sender buf nof yef completely acknowledged (i.e., fhe sender has nof received 
an ACK for fhem). We refer fo fhe window size as fhe number of packefs in fhe 
window. The term window comes from fhe idea fhaf if you lined up all fhe packefs 
senf during a communicafion session in a long row buf had only a small aperfure 
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through which to view them, you would see only a subset of them—like peering 
through a window. The sender's window (and the line of ofher packefs) can be 
graphically depicfed as shown in Figure 12-1. 


Left Window 
Edge 


1 


Right Window 
Edge 

1 



Acknowiedged Sent Yet 


Figure 12-1 The sender's window, showing which packets are eligible to be sent (or have already 
been sent), which are not yet eligible, and which have already been sent and acknowl¬ 
edged. In this example, the window size is fixed at three packets. 


This figure shows fhe currenf window of fhree packefs, for a fofal window 
size of 3. Packef number 3 has already been senf and acknowledged, so fhe copy 
of if fhaf fhe sender was keeping can now be released. Packef 7 is ready af fhe 
sender buf nof yef able fo be senf because if is nof yef "in" fhe window. If we now 
imagine fhaf dafa sfarfs fo flow from fhe sender fo fhe receiver and ACKs sfarf fo 
flow in fhe reverse direcfion, fhe sender mighf nexf receive an ACK for packef 4. 
When fhis happens, fhe window "slides" fo fhe righf by one packef, meaning fhaf 
fhe copy of packef 4 can be released and packef 7 can be senf. This movemenf of 
fhe window gives rise fo anofher name for fhis type of protocol, a sliding window 
profocol. 

The sliding window approach can be used fo combaf many of fhe problems 
described so far. Typically, fhis window sfrucfure is kepf af bofh fhe sender and 
fhe receiver. Af fhe sender, if keeps frack of whaf packefs can be released, whaf 
packefs are awaifing ACKs, and whaf packefs cannof yef be senf. Af fhe receiver, if 
keeps frack of whaf packefs have already been received and acknowledged, whaf 
packefs are expected (and how much memory has been allocated fo hold fhem), 
and which packefs, even if received, will nof be kepf because of limited memory. 
Alfhough fhe window sfrucfure is convenienf for keeping frack of dafa as if flows 
befween sender and receiver, if does nof provide guidance as fo how large fhe 
window should be, or whaf happens if fhe receiver or nefwork cannof handle fhe 
sender's dafa rate. We shall now see how fhese are relafed. 
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12.1.3 Variable Windows: Flow Control and Congestion Control 

To handle the problem that arises when a receiver is too slow relative to a sender, 
we introduce a way to force the sender to slow down when the receiver cannot 
keep up. This is called control and is usually handled in one of fwo ways. One 
way, called rate-based flow confrol, gives fhe sender a cerfain dafa rafe allocafion 
and ensures fhaf dafa is never allowed fo be senf af a rafe fhaf exceeds fhe alloca¬ 
fion. This type of flow confrol is mosf appropriafe for sfreaming applicafions and 
can be used wifh broadcasf and mulficasf delivery (see Chapfer 9). 

The ofher predominanf form of flow confrol is called window-based flow con¬ 
frol and is fhe mosf popular approach when sliding windows are being used. In 
fhis approach, fhe window size is nof fixed buf is insfead allowed fo vary over 
fime. To achieve flow confrol using fhis fechnique, fhere musf be a mefhod for fhe 
receiver fo signal fhe sender how large a window fo use. This is fypically called a 
window advertisement, or simply a window update. This value is used by fhe sender 
(i.e., fhe receiver of fhe window adverfisemenf) fo adjusf ifs window size. Logi¬ 
cally, a window updafe is separafe from fhe ACKs we discussed previously, buf 
in pracfice fhe window updafe and ACK are carried in a single packef, meaning 
fhaf fhe sender fends fo adjusf fhe size of ifs window af fhe same fime if slides if 
fo fhe righf. 

If we consider fhe effecf of changing fhe window size af fhe sender, if becomes 
clear how fhis achieves flow confrol. The sender is allowed fo injecf W packefs 
info fhe nefwork before if hears an ACK for any of fhem. If fhe sender and receiver 
are sufficienfly fasf, and fhe nefwork loses no packefs and has an infinife capac- 
ify, fhis means fhaf fhe fransfer rafe is proporfional fo (SW/R) bifs/s, where W is 
fhe window size, S is fhe packef size in bifs, and R is fhe RTT. When fhe window 
adverfisemenf from fhe receiver clamps fhe value of W af fhe sender, fhe sender's 
overall rafe can be limifed so as fo nof overwhelm fhe receiver. This approach 
works fine for profecfing fhe receiver, buf whaf abouf fhe nefwork in befween? We 
may have roufers wifh limifed memory befween fhe sender and fhe receiver fhaf 
have fo confend wifh slow nefwork links. When fhis happens, if is possible for fhe 
sender's rafe fo exceed a roufer's abilify fo keep up, leading fo packef loss. This is 
addressed wifh a special form of flow confrol called congestion control. 

Congesfion confrol involves fhe sender slowing down so as fo nof overwhelm 
fhe nefwork befween ifself and fhe receiver. Recall fhaf in our discussion of flow 
confrol, we used a window adverfisemenf fo signal fhe sender fo slow down for fhe 
receiver. This is called explicit signaling, because fhere is a profocol field specifi¬ 
cally used fo inform fhe sender abouf whaf is happening. Anofher opfion mighf be 
for fhe sender fo guess fhaf if needs fo slow down. Such an approach would involve 
implicit signaling—fhaf is, if would involve deciding fo slow down based on some 
ofher evidence. 

The problem of congesfion confrol in dafagram-sfyle nefworks, and more gen¬ 
erally queuing theory fo which if is closely relafed, has remained a major research 
topic for years, and if is unlikely fo ever be solved complefely for all circumsfances. 
If is also nof pracfical fo discuss all fhe opfions and mefhods of performing flow 
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control here. The interested reader is referred to [J90], [K97], and [K75]. In Chapter 
16 we will explore the particular congestion control technique used with TCP in 
more detail, along with a number of varianfs fhaf have arisen over fhe years. 

12.1.4 Setting the Retransmission Timeout 

One of fhe mosf imporfanf performance issues fhe designer of a refransmission- 
based reliable protocol faces is how long to waif before concluding fhaf a packef 
has been losf and should be resenf. Sfafed anofher way, Whaf should fhe refrans- 
mission fimeouf be? Infuifively, fhe amounf of fime fhe sender should waif before 
resending a packef is abouf fhe sum of fhe following fimes: fhe fime to send fhe 
packef, fhe fime for fhe receiver to process if and send an ACK, fhe fime for fhe 
ACK to fravel back fo fhe sender, and fhe fime for fhe sender to process fhe ACK. 
Unforfunafely, in pracfice, none of fhese fimes are known wifh cerfainfy. To make 
maffers worse, any or all of fhem vary over fime as addifional load is added fo or 
removed from fhe end hosfs or roufers. 

Because if is nof pracfical for fhe user fo fell fhe profocol implemenfafion whaf 
fhe values of all fhe fimes are (or fo keep fhem up-fo-dafe) for all circumsfances, a 
beffer sfrafegy is fo have fhe profocol implemenfafion fry fo esfimafe fhem. This is 
called round-trip-time estimation and is a sfafisfical process. Basically, fhe frue RTT 
is likely fo be close fo fhe sample mean of a collecfion of samples of RTTs. Nofe fhaf 
fhis average nafurally changes over fime (if is nof sfafionary), as fhe pafhs faken 
fhrough fhe nefwork may change. 

Once some esfimafe of fhe RTT is made, fhe quesfion of seffing fhe acfual 
fimeouf value, used fo frigger refransmissions, remains. If we recall fhe defini- 
fion of a mean, if can never be fhe exfreme value of a sef of samples (unless fhey 
are all fhe same). So, if would nof be sensible fo sef fhe refransmission fimer fo be 
exacfly equal fo fhe mean esfimafor, as if is likely fhaf many acfual RTTs will be 
larger, fhereby inducing unwanfed refransmissions. Clearly, fhe fimeouf should 
be sef fo somefhing larger fhan fhe mean, buf exacfly whaf fhis relafionship is (or 
even if fhe mean should be direcfly used) is nof yef clear. Seffing fhe fimeouf too 
large is also undesirable, as fhis leads back fo leffing fhe nefwork go idle, reducing 
fhroughpuf. We shall defer furfher explorafion of fhis topic fo Chapfer 14, where 
we explore how TCP, in parficular, approaches fhis problem. 


12.2 Introduction to TCP 

Given fhe background we now have regarding fhe issues affecfing reliable deliv¬ 
ery in general, lef us see how fhey play ouf in TCP and whaf type of service if 
provides fo Infernef applicafions. We also look af fhe fields in fhe TCP header, 
noficing how many of fhe concepfs we have seen so far (e.g., ACKs, window adver- 
fisemenfs) are capfured in fhe header descripfion. In fhe chapfers fhaf follow, we 
examine all of fhese header fields in more defail. 
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Our description of TCP starts in this chapter and continues in the next five 
chapters. Chapter 13 describes how a TCP connection is established and termi¬ 
nated. Chapter 14 details how TCP estimates the per-connection RTT and how 
the retransmission timeout is set based on this estimate. Chapter 15 looks at the 
normal transfer of data, starting with "interactive" applications (such as chat). It 
then covers window management and flow control, which apply to both interac¬ 
tive and "bulk" data flow applications (such as file transfer), along with TCP's 
urgent mechanism, which allows a sender to mark certain data in the data stream 
as special. Chapter 16 takes a look at congestion control algorithms in TCP that 
help to reduce packet loss when the network is very busy. It also discusses some 
modifications that have been proposed to increase throughput on fast networks 
or improve resiliency on lossy (e.g., wireless) networks. Finally, Chapter 17 shows 
how TCP keeps connections active even when no data is flowing. 

The original specification for TCP is [RFC0793], although some errors in that RFC 
are corrected in the Host Requirements RFC, [RFC1122]. Since then, specifications 
for TCP have been revised and extended to include clarified and improved conges¬ 
tion control behavior [RFC5681][RFC3782][RFC3517][RFC3390][RFC3168], retrans¬ 
mission timeouts [RFC6298][RFC5682][RFC4015], operation with NATs [RFC5382], 
acknowledgment behavior [RFC2883], security [RFC6056][RFC5927][RFC5926], con¬ 
nection management [RFC5482], and urgent mechanism implementation guidelines 
[RFC6093]. There have also been a rich variety of experimental modifications cov¬ 
ering retransmission behaviors [RFC5827][RFC3708], congestion detection and con¬ 
trol [RFC5690][RFC5562][RFC4782][RFC3649][RFC2861], and other features. Finally, 
there is an effort to explore how TCP might take advantage of multiple simultaneous 
network-layer paths [RFC6182]. 

12.2.1 The TCP Service Model 

Even though TCP and UDP use the same network layer (IPv4 or IPv6), TCP pro¬ 
vides a totally different service to the application layer from what UDP does. TCP 
provides a connection-oriented, reliable, byte stream service. The term connection- 
oriented means that the two applications using TCP must establish a TCP connec¬ 
tion by contacting each other before they can exchange data. The typical analogy 
is dialing a telephone number, waiting for the other party to answer the phone 
and saying "Hello," and then saying "Who's calling?" There are exactly two end¬ 
points communicating with each other on a TCP connection; concepts such as 
broadcasting and multicasting (see Chapter 9) are not applicable to TCP 

TCP provides a byte stream abstraction to applications that use it. The conse¬ 
quence of this design decision is that no record markers or message boundaries 
are automatically inserted by TCP (see Chapter 1). A record marker corresponds 
to an indication of an application's write extent. If the application on one end 
writes 10 bytes, followed by a write of 20 bytes, followed by a write of 50 bytes, the 
application at the other end of the connection cannot tell what size the individual 
writes were. For example, the other end may read the 80 bytes in four reads of 20 
bytes at a time or in some other way. One end puts a stream of bytes into TCP, and 
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the identical stream of bytes appears at the other end. Each endpoint individually 
chooses its read and write sizes. 

TCP does not interpret the contents of fhe byfes in fhe byfe sfream af all. If has 
no idea if fhe dafa byfes being exchanged are binary dafa, ASCII characfers, EBCDIC 
characfers, or somefhing else. The inferprefafion of fhis byfe sfream is up fo fhe 
applicafions on each end of fhe connecfion. TCP does, however, supporf fhe urgenf 
mechanism menfioned before, alfhough if is no longer recommended for use. 

12.2.2 Reliability in TCP 

TCP provides reliabilify using specific variafions on fhe fechniques jusf described. 
Because if provides a byfe sfream inferface, TCP musf converf a sending applica- 
fion's sfream of byfes info a sef of packefs fhaf IP can carry. This is called packetiza¬ 
tion. These packefs confain sequence numbers, which in TCP acfually represenf 
fhe byfe offsefs of fhe firsf byfe in each packef in fhe overall dafa sfream rafher 
fhan packef numbers. This allows packefs fo be of variable size during a fransfer 
and may also allow fhem fo be combined, called repacketization. The applicafion 
dafa is broken info whaf TCP considers fhe besf-size chunks fo send, fypically 
fiffing each segmenf info a single IP-layer dafagram fhaf will nof be fragmenfed. 
This is differenf from UDP, where each wrife by fhe applicafion usually gener- 
afes a UDP dafagram of fhaf size (plus headers). The chunk passed by TCP fo IP 
is called a segment (see Pigure 12-2). In Chapfer 15 we shall see how TCP decides 
whaf size a segmenf should be. 

TCP mainfains a mandafory checksum on ifs header, any associafed appli¬ 
cafion dafa, and fields from fhe IP header. This is an end-fo-end pseudo-header 
checksum whose purpose is fo defecf any bif errors infroduced in fransif. If a 
segmenf arrives wifh an invalid checksum, TCP discards if wifhouf sending any 
acknowledgmenf for fhe discarded packef. The receiving TCP mighf acknowledge 
a previous (already acknowledged) segmenf, however, fo help fhe sender wifh ifs 
congesfion confrol compufafions (see Chapfer 16). The TCP checksum uses fhe 
same mafhemafical funcfion as is used by ofher Infernef protocols (UDP, ICMP, 
efc.). Por large dafa fransfers, fhere is some concern fhaf fhis checksum is nof 
really sfrong enough [SPOO], so careful applicafions should apply fheir own error 
profecfion mefhods (e.g., sfronger checksums or CRCs) or use a middleware layer 
fo achieve fhe same resulf (e.g., see [RPC5044]). 

When TCP sends a group of segmenfs, if normally sefs a single refransmission 
fimer, waifing for fhe ofher end fo acknowledge recepfion. TCP does nof sef a dif¬ 
ferenf refransmission fimer for every segmenf. Rafher, if sefs a fimer when if sends 
a window of dafa and updates fhe fimeouf as ACKs arrive. If an acknowledgmenf 
is nof received in fime, a segmenf is refransmiffed. In Chapfer 14 we will look af 
TCP's adapfive fimeouf and refransmission sfrafegy in more defail. 

When TCP receives dafa from fhe ofher end of fhe connecfion, if sends an 
acknowledgmenf. This acknowledgmenf may nof be senf immediafely buf is nor¬ 
mally delayed a fracfion of a second. The ACKs used by TCP are cumulative in fhe 
sense fhaf an ACK indicafing byfe number N implies fhaf all byfes up fo number N 
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(but not including it) have already been received successfully. This provides some 
robustness against ACK loss—if an ACK is losf, if is very likely fhaf a subsequenf 
ACK is sufficienf fo ACK fhe previous segmenfs. 

TCP provides a full-duplex service fo fhe applicafion layer. This means fhaf 
dafa can be flowing in each direcfion, independenf of fhe ofher direcfion. There¬ 
fore, each end of a connecfion musf mainfain a sequence number of fhe dafa flow¬ 
ing in each direcfion. Once a connecfion is esfablished, every TCP segmenf fhaf 
confains dafa flowing in one direcfion of fhe connecfion also includes an ACK for 
segmenfs flowing in fhe opposife direcfion. Each segmenf also confains a win¬ 
dow adverfisemenf for implemenfing flow confrol in fhe opposife direcfion. Thus, 
when a TCP segmenf arrives on a connecfion, fhe window may slide forward, 
fhe window size may change, and new dafa may have arrived. As we shall see in 
Chapfer 13, a fully acfive TCP connecfion is bidirecfional and symmefric; dafa can 
flow equally well in eifher direcfion. 

Using sequence numbers, a receiving TCP discards duplicafe segmenfs and 
reorders segmenfs fhaf arrive ouf of order. Recall fhaf any of fhese anomalies 
can happen because TCP uses IP fo deliver ifs segmenfs, and IP does nof provide 
duplicafe eliminafion or guaranfee correcf ordering. Because if is a byfe sfream 
profocol, however, TCP never delivers dafa fo fhe receiving application ouf of order. 
Thus, fhe receiving TCP may be forced fo hold on fo dafa wifh larger sequence 
numbers before giving if fo an applicafion unfil a missing lower-sequence-num¬ 
bered segmenf (a "hole") is filled in. 

We will now begin fo look af some of fhe defails of TCP. In fhis chapfer we 
will only infroduce fhe encapsulafion and header sfrucfure for TCP. Ofher defails 
appear in fhe nexf five chapfers. TCP can be used wifh IPv4 or IPv6, and fhe 
pseudo-header checksum if uses (similar fo UDP's) is mandafory for use wifh 
eifher IPv4 or IPv6. 


12.3 TCP Header and Encapsulation 

TCP is encapsulafed in IP dafagrams as shown in Figure 12-2. 


TCP Segment 


IP Header 

TCP 

Header 

TCP (Application) Data 

Proto (IPv4) or Next Header (IPv6) = 6 



(20 bytes—IPv4, no options; (20 bytes, 

40 bytes—IPv6) no options) 

I- IP Datagram 


Figure 12-2 The TCP header appears immediately following the IP header or last IPv6 extension 
header and is often 20 bytes long (with no TCP options). With options, the TCP header 
can be as large as 60 bytes. Common options include Maximum Segment Size, Time- 
stamps, Window Scaling, and Selective ACKs. 
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The header itself is considerably more complicated than the header we saw 
for UDP in Chapter 10. This is not very surprising, as TCP is a significantly more 
complicated protocol that must keep each end of the connection informed (syn¬ 
chronized) about the current state. It is shown in Figure 12-3. 
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Figure 12-3 The TCP header. Its normal size is 20 bytes, unless options are present. The Header 
Length field gives the size of the header in 32-bit words (minimum value is 5). The 
shaded fields {Acknowledgment Number, Window Size, plus ECE and ACK bits) refer to the 
data flowing in the opposite direction relative to the sender of this segment. 


Each TCP header contains the source and destination port number. These 
two values, along with the source and destination IP addresses in the IP header, 
uniquely identify each connection. The combination of an IP address and a port 
number is sometimes called an endpoint or socket in the TCP literature. The latter 
term appeared in [RFC0793] and was ultimately adopted as the name of the Berke¬ 
ley-derived programming interface for network communications (now frequently 
called "Berkeley sockets"). It is a pair of sockets or endpoints (the 4-tuple con¬ 
sisting of the client IP address, client port number, server IP address, and server 
port number) that uniquely identifies each TCP connection. This fact will become 
important when we look at how a TCP server can communicate with multiple 
clients (see Chapter 13). 

The Sequence Number field identifies the byte in the stream of data from the 
sending TCP to the receiving TCP that the first byte of data in the containing 
segment represents. If we consider the stream of bytes flowing in one direction 
between two applications, TCP numbers each byte with a sequence number. This 
sequence number is a 32-bit unsigned number that wraps back around to 0 after 
reaching (2^^) - 1. Because every byte exchanged is numbered, the Acknowledgment 
Number field (also called the ACK Number or ACK field for short) contains the next 
sequence number that the sender of the acknowledgment expects to receive. This 
is therefore the sequence number of the last successfully received byte of data plus 
1. This field is valid only if the ACK bit field (described later in this section) is on. 






Section 12.3 TCP Header and Encapsulation 


589 


which it usually is for all but initial and closing segments. Sending an ACK costs 
nothing more than sending any other TCP segment because the 32-bit ACK Num¬ 
ber field is always part of the header, as is the ACK bit field. 

When a new connection is being established, the SYN bit field is turned on in 
the first segment sent from client to server. Such segments are called SYN segments, 
or simply SYNs. The Sequence Number field then contains the first sequence number 
to be used on that direction of the connection for subsequent sequence numbers 
and in returning ACK numbers (recall that connections are all bidirectional). Note 
that this number is not 0 or 1 but instead is another number, often randomly cho¬ 
sen, called the initial sequence number (ISN). The reason for the ISN not being 0 or 1 
is a security measure and will be discussed in Chapter 13. The sequence number 
of the first byte of data sent on this direction of the connection is the ISN plus 1 
because the SYN bit field consumes one sequence number. As we shall see later, 
consuming a sequence number also implies reliable delivery using retransmission. 
Thus, SYNs and application bytes (and FINs, which we will see later) are reliably 
delivered. ACKs, which do not consume sequence numbers, are not. 

TCP can be described as "a sliding window protocol with cumulative positive 
acknowledgments." The ACK Number field is constructed to indicate the largest 
byte received in order at the receiver (plus 1). For example, if bytes 1-1024 are 
received OK, and the next segment contains bytes 2049-3072, the receiver cannot 
use the regular ACK Number field to signal the sender that it received this new 
segment. Modern TCPs, however, have a selective acknowledgment (SACK) option 
that allows the receiver to indicate to the sender out-of-order data it has received 
correctly. When paired with a TCP sender capable of selective repeat, a significant 
performance benefit may be realized [FF96]. In Chapter 14 we will see how TCP 
uses duplicate acknowledgments (multiple segments with the same ACK field) to 
help with its congestion control and error control procedures. 

The Header Length field gives the length of the header in 32-bit words. This is 
required because the length of the Options field is variable. With a 4-bit field, TCP 
is limited to a 60-byte header. Without options, however, the size is 20 bytes. 

Currently eight bit fields are defined for the TCP header, although some older 
implementations understand only the last six of them.^ One or more of them can 
be turned on at the same time. We briefly mention their use here and discuss each 
of them in more detail in later chapters. 

1. CWR —Congestion Window Reduced (the sender reduced its sending rate); 
see Chapter 16. 

2. ECE —ECN Echo (the sender received an earlier congestion notification); 
see Chapter 16. 

3. URG —Urgent (the Urgent Pointer field is valid—rarely used); see Chapter 15. 


1. Note that [RFC3540], an experimental RFC, also defines the least significant of the Resv bits as a 
nonce sum (NS). See Section 16.12. 
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4. ACK —^Acknowledgment (the Acknowledgment Number field is valid— 
always on after a connection is established); see Chapters 13 and 15. 

5. PSH —Push (the receiver should pass this data to the application as soon as 
possible—not reliably implemented or used); see Chapter 15. 

6. RST —Reset the connection (connection abort, usually because of an error); 
see Chapfer 13. 

7. SYN —Synchronize sequence numbers fo inifiafe a cormecfion; see Chapfer 13. 

8. FIN —^The sender of fhe segmenf is finished sending dafa fo ifs peer; see 
Chapfer 13. 

TCP's flow confrol is provided by each end adverfising a window size using 
fhe Window Size field. This is fhe number of byfes, sfarfing wifh fhe one specified 
by fhe ACK number, fhaf fhe receiver is willing fo accepf. This is a 16-bif field, 
limifing fhe window fo 65,535 byfes, and fhereby limifing TCP's fhroughpuf per¬ 
formance. In Chapfer 15 we will look af fhe Window Scale opfion fhaf allows fhis 
value fo be scaled, providing much larger windows and improved performance 
for high-speed and long-delay nefworks. 

The TCP Checksum field covers fhe TCP header and dafa and some fields in 
fhe IP header, using a pseudo-header compufafion similar fo fhe one used wifh 
ICMPv6 and UDP fhaf we discussed in Chapfers 8 and 10. If is mandafory for fhis 
field fo be calculafed and stored by fhe sender, and fhen verified by fhe receiver. 
The TCP checksum is calculafed wifh fhe same algorifhm as fhe IP, ICMP, and 
UDP ("Infernef") checksums. 

The Urgent Pointer field is valid only if fhe URG bif field is sef. This "pointer" is 
a posifive offsef fhaf musf be added fo fhe Sequence Number field of fhe segmenf fo 
yield fhe sequence number of fhe last byfe of urgenf dafa. TCP's urgenf mechanism 
is a way for fhe sender fo provide specially marked dafa fo fhe ofher end. 

The mosf common Option field is fhe Maximum Segmenf Size opfion, called 
fhe MSS. Each end of a connecfion normally specifies fhis opfion on fhe firsf seg¬ 
menf if sends (fhe ones wifh fhe SYN bif field sef fo esfablish fhe connecfion). The 
MSS opfion specifies fhe maximum-size segmenf fhaf fhe sender of fhe opfion is 
willing fo receive in fhe reverse direcfion. We describe fhe MSS opfion in more 
defail in Chapfer 13 and some of fhe ofher TCP opfions in Chapfers 14 and 15. Ofher 
common opfions we invesfigafe include SACK, Timesfamp, and Window Scale. 

In Figure 12-2 we nofe fhaf fhe dafa porfion of fhe TCP segmenf is opfional. 
We will see in Chapfer 13 fhaf when a connecfion is esfablished, and when a con¬ 
necfion is ferminafed, segmenfs are exchanged fhaf confain only fhe TCP header 
(wifh or wifhouf opfions) buf no dafa. A header wifhouf any dafa is also used 
fo acknowledge received dafa, if fhere is no dafa fo be fransmiffed in fhaf direc¬ 
fion (called a pure ACK), and fo nofify fhe communicafion peer of a change in fhe 
window size (called a window update). There are also some cases resulfing from 
fimeoufs when a segmenf can be senf wifhouf any dafa. 
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12.4 Summary 

The problem of providing reliable communications over lossy communication 
channels has been studied for years. The two primary methods for dealing with 
errors include error-correcting codes and data retransmission. The protocols using 
retransmissions must also handle data loss, usually by setting a timer, and must 
also arrange some way for the receiver to signal the sender what it has received. 
Deciding how long to wait for an ACK can be tricky, as the appropriate time may 
change as network routing or load on the end systems varies. Modern protocols 
estimate the round-trip time and set the retransmission timer based on some 
function of these measurements. 

Except for setting the retransmission timer, retransmission protocols are sim¬ 
ple when only one packet may be in the network at one time, but they perform 
poorly for networks where the delay is high. To be more efficient, multiple packets 
must be injected into the network before an ACK is received. This approach is more 
efficient but also more complex. A typical approach to managing the complexity is 
to use sliding windows, whereby packets are marked with sequence numbers, and 
the window size bounds the number of such packets. When the window size var¬ 
ies based on either feedback from the receiver or other signals (such as dropped 
packets), both flow control and congestion control can be achieved. 

TCP provides a reliable, connection-oriented, byte stream, transport-layer ser¬ 
vice built using many of these techniques. We looked briefly at all of the fields 
in the TCP header, noting that most of them are directly related to these abstract 
concepts in reliable delivery. We will examine them in detail in the chapters that 
follow. TCP packetizes the application data into segments, sets a timeout anytime 
it sends data, acknowledges data received by the other end, reorders out-of-order 
data, discards duplicate data, provides end-to-end flow control, and calculates and 
verifies a mandatory end-to-end checksum. It is the most widely used protocol on 
the Internet. It is used by most of the popular applications, such as HTTP, SSH/ 
TLS, NetBIOS (NBT—NetBIOS over TCP), Telnet, FTP, and electronic mail (SMTP). 
Many distributed file-sharing applications (e.g., BitTorrent, Shareaza) also use TCP. 
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13.1 Introduction 

TCP is a unicast connection-oriented protocol. Before either end can send data to the 
other, a connection must be established between them. In this chapter, we take a 
detailed look at what a TCP connection is, how it is established, and how it is ter¬ 
minated. Recall that TCP's service model is a byte stream. TCP detects and repairs 
essentially all the data transfer problems fhaf may be infroduced by packef loss, 
duplicafion, or errors af fhe IP layer (or below). 

Because of ifs managemenf of connection state (informafion abouf fhe connec- 
fion kepf by bofh endpoinfs), TCP is a considerably more complicafed protocol 
fhan UDP (see Chapter 10). UDP is a connectionless protocol fhaf involves no con- 
necfion esfablishmenf or ferminafion. One of fhe major differences we shall see 
befween fhe fwo is fhe amounf of defail required to handle fhe various TCP sfafes 
properly: when connecfions are created, terminated normally, and resef wifhouf 
warning. In ofher chapters we will look af whaf happens once fhe connecfion is 
esfablished and dafa is fransferred. 

During connecfion esfablishmenf, several options can be exchanged befween 
fhe fwo endpoinfs regarding fhe paramefers of fhe connecfion. Some opfions are 
allowed to be senf only when fhe connecfion is esfablished, and ofhers can be senf 
lafer. Recall from Chapter 12 fhaf fhe TCP header has a limifed space for holding 
opfions (40 bytes). 


13.2 TCP Connection Establishment and Termination 

A TCP connection is defined to be a 4-fuple consisfing of fwo IP addresses and fwo 
porf numbers. More precisely, if is a pair of endpoints or sockets where each end- 
poinf is idenfified by an (IP address, porf number) pair. 
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A connection typically goes through three phases: setup, data transfer (called 
established), and teardown (closing). As we will see, some of fhe difficulfy in creaf- 
ing a robusf TCP implemenfafion is handling all of fhe fransifions befween and 
among fhese phases correcfly A fypical TCP connecfion esfablishmenf and close 
(wifhouf any dafa fransfer) is shown in Figure 13-1. 
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Figure 13-1 A normal TCP connection establishment and termination. Usually, the client initiates a three-way 
handshake to exchange initial sequence numbers carried on SYN segments for the client and 
server (ISN(c) and ISN(s), respectively). The connection terminates after each side has sent a FIN 
and received an acknowledgment for it. 


The figure shows a fimeline of whaf happens during connecfion esfablish¬ 
menf. To esfablish a TCP connecfion, fhe following evenfs usually fake place: 

1. The active opener (normally called fhe clienf) sends a SYN segmenf (i.e., a 
TCP/IP packef wifh fhe SYN bif field fumed on in fhe TCP header) specify¬ 
ing fhe porf number of fhe peer fo which if wanfs fo connecf and fhe clienf's 
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initial sequence number or ISN(c) (see Section 13.2.3). It typically sends one 
or more options at this point (see Section 13.3). This is segment 1. 

2. The server responds with its own SYN segment containing its initial 
sequence number (ISN(s)). This is segment 2. The server also acknowledges 
the client's SYN by ACKing ISN(c) plus 1. A SYN consumes one sequence 
number and is retransmitted if lost. 

3. The client must acknowledge this SYN from fhe server by ACKing ISN(s) 
plus 1. This is segmenf 3. 

These fhree segmenfs complefe fhe connecfion esfablishmenf. This is offen 
called fhe three-way handshake. Ifs main purposes are fo lef each end of fhe connec¬ 
fion know fhaf a connecfion is sfarfing and fhe special defails fhaf are carried as 
opfions, and fo exchange fhe ISNs. 

The side fhaf sends fhe firsf SYN is said fo perform an active open. As men- 
fioned, fhis is fypically a clienf. The ofher side, which receives fhis SYN and sends 
fhe nexf SYN, performs a passive open. If is mosf commonly called fhe server. (In 
Secfion 13.2.2 we describe a supporfed buf unusual simultaneous open when bofh 
sides can do an acfive open af fhe same fime and become bofh clienfs and servers.) 


Note 

TCP supports the capability of carrying application data on SYN segments. This 
is rarely used, however, because the Berkeley sockets API does not support it. 


Figure 13-1 also shows how a TCP connection is closed (also called cleared or 
terminated). Either end can initiate a close operation, and simultaneous closes are 
also supported but are rare. Traditionally, it was most common for fhe clienf fo 
inifiafe a close (as shown in Figure 13-1). However, ofher servers (e.g., Web servers) 
inifiafe a close affer fhey have complefed a requesf. Usually a close operafion sfarfs 
wifh an applicafion indicafing ifs desire fo ferminafe ifs connecfion (e.g., using fhe 
close () sysfem call). The closing TCP inifiafes fhe close operafion by sending a 
FIN segmenf (i.e., a TCP segmenf wifh fhe FIN bif field sef). The complefe close 
operafion occurs affer bofh sides have complefed fhe close: 

1. The active closer sends a FIN segmenf specifying fhe currenf sequence num¬ 
ber fhe receiver expecfs fo see (K in Figure 13-1). The FIN also includes an 
ACK for fhe lasf dafa senf in fhe ofher direcfion (labeled L in Figure 13-1). 

2. The passive closer responds by ACKing value K + 1 fo indicafe ifs success¬ 
ful receipf of fhe acfive closer's FIN. Af fhis poinf, fhe applicafion is nofi- 
fied fhaf fhe ofher end of ifs connecfion has performed a close. Typically 
fhis resulfs in fhe applicafion inifiafing ifs own close operafion. The passive 
closer fhen effecfively becomes anofher acfive closer and sends ifs own FIN. 
The sequence number is equal fo L. 
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3. To complete the close, the final segment contains an ACK for fhe lasf FIN. 

Nofe fhaf if a FIN is losf, if is refransmiffed unfil an ACK for if is received. 

While if fakes fhree segmenfs fo esfablish a connecfion, if fakes four fo fermi- 
nafe one. If is also possible for fhe connecfion fo be in a half-open sfafe (see Secfion 
13.6.3), alfhough fhis is nof common. This reason is fhaf TCP's dafa communica- 
fions model is bidirecfional, meaning if is possible fo have only one of fhe fwo 
direcfions operafing. The half-close operafion in TCP closes only a single direcfion 
of fhe dafa flow. Two half-close operafions fogefher close fhe enfire connecfion. 
The rule is fhaf eifher end can send a FIN when if is done sending dafa. When a 
TCP receives a FIN, if musf nofify fhe applicafion fhaf fhe ofher end has fermi- 
nafed fhaf direcfion of dafa flow. The sending of a FIN is normally fhe resulf of 
fhe applicafion issuing a close operafion, which fypically causes bofh direcfions 
fo close. 

The seven segmenfs we have seen are baseline overheads for any TCP connec¬ 
fion fhaf is esfablished and cleared "gracefully." (There are more abrupf ways fo 
fear down a TCP connecfion using special resef segmenfs, which we cover lafer.) 
When a small amounf of dafa needs fo be exchanged, if is now apparenf why some 
applicafions prefer fo use UDP because of ifs abilify fo send and receive dafa wifh- 
ouf esfablishing connecfions. Flowever, such applicafions are fhen faced wifh han¬ 
dling fheir own error repair feafures, congesfion managemenf, and flow confrol. 

13.2.1 TCP Half-Close 

As we have menfioned, TCP supporfs a half-close operafion. Few applicafions 
require fhis capabilify, so if is nof common. To use fhis feafure, fhe API musf pro¬ 
vide a way for fhe applicafion fo say, essenfially, "I am done sending dafa, so send 
a FIN fo fhe ofher end, buf I sfill wanf fo receive dafa from fhe ofher end, unfil if 
sends me a FIN." The Berkeley sockefs API supporfs half-close, if fhe applicafion 
calls fhe shutdown () funcfion insfead of calling fhe more fypical close {) func- 
fion. Mosf applicafions, however, ferminafe bofh direcfions of fhe connecfion by 
calling close. Figure 13-2 shows an example of a half-close being used. We show 
fhe clienf on fhe leff side inifiafing fhe half-close, buf eifher end can do fhis. 

The firsf fwo segmenfs are fhe same as for a regular close: a FIN by fhe inifia- 
for, followed by an ACK of fhe FIN by fhe recipienf. The operafion fhen differs 
from Figure 13-1, because fhe side fhaf receives fhe half-close can sfill send dafa. 
We show only one dafa segmenf, followed by an ACK, buf any number of dafa 
segmenfs can be senf. (We falk more abouf fhe exchange of dafa segmenfs and 
acknowledgmenfs in Chapfer 15.) When fhe end fhaf received fhe half-close is 
done sending dafa, if closes ifs end of fhe connecfion, causing a FIN fo be senf, and 
fhis delivers an end-of-file indicafion fo fhe applicafion fhaf inifiafed fhe half-close. 
When fhis second FIN is acknowledged, fhe connecfion is complefely closed. 
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Figure 13-2 With the TCP half-close operation, one direction of the connection can terminate while the other 
continues until it is closed. Few applications use this feature. 


13.2.2 Simultaneous Open and Close 

It is possible, although highly improbable unless specifically arranged, for fwo 
applicafions fo perform an acfive open fo each ofher af fhe same fime. Each end 
musf have fransmiffed a SYN before receiving a SYN from fhe ofher side; fhe 
SYNs musf pass each ofher on fhe nefwork. This scenario also requires each end 
fo have an IP address and porf number fhaf are known fo fhe ofher end, which is 
rare (excepf for fhe firewall "hole-punching" fechniques we saw in Chapfer 7). If 
fhis happens, if is called a simultaneous open. 

For example, a simulfaneous open occurs when an applicafion on hosf A using 
local porf 7777 performs an acfive open fo porf 8888 on hosf B, while af fhe same 
fime an applicafion on hosf B using local porf 8888 performs an acfive open fo 
porf 7777 on hosf A. This is not fhe same as connecfing a clienf on hosf A fo a 
server on hosf B, while af fhe same fime having a clienf on hosf B connecf fo a 
convenfional server on hosf A. In fhaf case, bofh servers perform passive opens, 
nof acfive opens, and fhe clienfs assign fhemselves differenf ephemeral porf num¬ 
bers. This resulfs in fwo disfincf TCP connecfions. Figure 13-3 shows fhe segmenfs 
exchanged during a simulfaneous open. 
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Figure 13-3 Segments exchanged during simultaneous open. One additional segment is required compared 
to the ordinary connection establishment procedure. The SYN bit field is on in each segment 
until an ACK for it is received. 


A simultaneous open requires the exchange of four segmenfs, one more fhan 
fhe normal fhree-way handshake. Also nofe fhaf we do nof call eifher end a cli- 
enf or a server, because bofh ends acf as clienf and server. A simultaneous close is 
nof very differenf. We said earlier fhaf one side (offen, buf nof always, fhe clienf) 
performs fhe acfive close, causing fhe firsf FIN fo be senf. In a simulfaneous close, 
bofh do. Figure 13-4 shows fhe segmenfs exchanged during a simulfaneous close. 
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Figure 13-4 Segments exchanged during simultaneous close work like a conventional close, but the 
segment ordering is interleaved. 
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With a simultaneous close the same number of segments are exchanged as in 
the normal close. The only real difference is fhaf fhe segmenf sequence is infer- 
leaved insfead of sequenfial. Lafer we will see fhaf simulfaneous open and close 
operafions use parficular sfafes in fhe TCP implemenfafion fhaf are nof commonly 
exercised. 

13.2.3 Initial Sequence Number (ISN) 

When a connecfion is open, any segmenf wifh fhe appropriafe fwo IP addresses 
and porf numbers is accepfed as valid provided fhe sequence number is valid 
(i.e., wifhin fhe window) and fhe checksum is OK. This brings up fhe quesfion of 
whefher if mighf be possible fo have TCP segmenfs being roufed fhrough fhe nef- 
work fhaf could show up lafer and disrupf a connecfion. This concern is addressed 
by careful selecfion of fhe ISN, which we now invesfigafe. 

Before each end sends ifs SYN fo esfablish fhe connecfion, if chooses an ISN 
for fhaf connecfion. The ISN should change over fime, so fhaf each connecfion 
has a differenf one. [RFC0793] specifies fhaf fhe ISN should be viewed as a 32-bif 
counfer fhaf incremenfs by 1 every 4ps. The purpose of doing fhis is fo arrange 
for fhe sequence numbers for segmenfs on one connecfion fo nof overlap wifh 
sequence numbers on a anofher (new) idenfical connecfion. In parficular, new 
sequence numbers musf nof be allowed fo overlap befween differenf instantiations 
(or incarnations) of fhe same connecfion. 

The idea of differenf insfanfiafions of fhe same connecfion becomes clear 
when we recall fhaf a TCP connecfion is idenfified by a pair of endpoinfs, creaf- 
ing a 4-fuple of fwo address/porf pairs. If a connecfion had one of ifs segmenfs 
delayed for a long period of fime and closed, buf fhen opened again wifh fhe same 
4-fuple, if is conceivable fhaf fhe delayed segmenf could reenfer fhe new connec- 
fion's dafa sfream as valid dafa. This would be mosf froublesome. By faking sfeps 
fo avoid overlap in sequence numbers befween connecfion insfanfiafions, we can 
fry fo minimize fhis risk. If does suggesf, however, fhaf an applicafion wifh a very 
greaf need for dafa infegrify should employ ifs own CRCs or checksums af fhe 
applicafion layer fo ensure fhaf ifs own dafa has been fransferred wifhouf error. 
This is generally good pracfice in any case, and if is commonly done for large files. 

As we shall see, knowing fhe connecfion 4-fuple as well as fhe currenfly acfive 
window of sequence numbers is all fhaf is required fo form a TCP segmenf fhaf is 
considered valid fo a communicafing TCP endpoinf. This represenfs a form of vul- 
nerabilify for TCP: anyone can forge a TCP segmenf and, if fhe sequence numbers, 
IP addresses, and porf numbers are chosen appropriafely, can inferrupf a TCP 
connecfion [RFC5961]. One way of repelling fhis is fo make fhe inifial sequence 
number (or ephemeral porf number [RFC6056]) relafively hard fo guess. Anofher 
is encrypfion (see Chapfer 18). 

In modern sysfems, fhe ISN is fypically selecfed in a semirandom way. An 
inferesfing discussion of fhe subflefies of doing fhis properly is confained in CERT 
Advisory CA-2001-09 [CERTISN]. Linux goes fhrough a fairly elaborafe process fo 
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select its ISNs. It uses a clock-based scheme but starts the clock at a random offset 
for each connecfion. The random offsef is chosen as a cryptographically hashed 
funcfion on fhe connecfion idenfifier (4-fuple). A secref inpuf to fhe hash func- 
fion changes every 5 minufes. Of fhe 32 bifs in fhe ISN, fhe fop-mosf 8 bifs are a 
sequence number of fhe secref, and fhe remaining bifs are generafed by fhe hash. 
This produces an ISN fhaf is difficulf fo guess, buf also one fhaf increases over 
fime. Windows reporfedly uses a similar scheme based on RC4 [S96]. 

13.2.4 Example 

Now fhaf we have a basic idea of how a TCP connecfion is esfablished and cleared, 
lef us look af fhe packef-level defails. To do so we make a TCP connecfion fo a 
nearby Web server running on fhe machine wifh IPv4 address 10.0.0.2. The cli- 
enf is fhe Telnef applicafion on Windows: 

C:\> telnet 10.0.0.2 80 

Welcome to Microsoft Telnet Client 
Escape Character is 'CTRL+]' 

... wait about 4.4 seconds ... 

Microsoft Telnet> quit 


The telnet command esfablishes a TCP connecfion wifh fhe hosf having IPv4 
address 10.0.0.2 on fhe porf corresponding fo fhe http or Web service (porf 80). 
When fhe Telnef program connecfs fo a porf ofher fhan 23 (fhe well-known porf 
for fhe Telnef protocol [RFC0854]), if does nof engage in fhe applicafion protocol. 
Instead, if merely copies bytes from ifs inpuf fo ifs TCP connecfion and vice versa. 
When a Web server receives fhe incoming connecfion requesf, fhe firsf fhing if does 
is awaif a requesf for a Web page. In fhis case, we do nof provide one, so fhe server 
does nof produce any dafa. This is ideal for us, because for now we are interested 
only in fhe connecfion esfablishmenf and terminal ion packef exchange. Figure 13-5 
shows fhe Wireshark oufpuf for fhe segmenfs generafed by fhis command. 

In fhe figure, we can see fhaf fhe clienf begins wifh a SYN segmenf confain- 
ing an ISN of 685506836 and window adverfisemenf of 65535. This segmenf also 
confains several opfions we discuss in Secfion 13.3. The second segmenf is bofh 
a SYN from fhe server and an ACK for fhe clienf. The sequence number (server's 
ISN) is 1479690171 and fhe ACK number is 685506837,1 more fhan fhe clienf's ISN. 
This indicates successful receipf of fhe clienf's ISN. This segmenf also includes a 
window adverfisemenf indicafing fhaf fhe server is willing fo accepf up fo 64,240 
byfes. Complefion of fhe fhree-way handshake fakes place wifh segmenf 3, which 
confains ACK number 1479690172. Remember fhaf ACK numbers are cumulafive 
and always indicafe fhe sequence number fhe sender of fhe ACK expecfs fo see 
next (nof fhe one fhaf if lasf received). 

After a pause of abouf 4.4s, fhe Telnef applicafion is insfrucfed fo close fhe 
connecfion. This resulfs in fhe clienf's TCP sending fhe FIN in segmenf 4. The 
sequence number of fhe FIN is 685506837, which is ACKed in segmenf 5 (wifh 
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Figure 13-5 A TCP connection between 192.168.35.130 and 10.0.0.2 is established and cleared without sending 
any data. The PSH (Push) bit indicates that segment 6 is sending all data from its buffer (which 
is none). 


ACK number 685506838). Shortly thereafter the server sends its own FIN with 
sequence number 1479690172. This segment also (redundantly) ACKs the client's 
FIN once again. Note that the PSH bit field is on. This has no real effecf on fhe 
closing of fhe connecfion buf usually indicafes fhaf fhe server has no addifional 
dafa fo send. The final segmenf ACKs fhe server's FIN by including ACK number 
1479690173. 


Note 

[RFC1025] calls a segment with the maximum number of features enabled (e.g., 
flags and options) a “Kamikaze” packet. Other coiorful terms inciude “nastygram,” 
“Christmas tree packet,” and “lamp test segment.” 

One thing we can see in Figure 13-5 is that the SYN segments contain one or more 
options. These take up additional space in the TCP header. For example, the length 
of the first TCP header is 44 bytes, 24 bytes greater than the minimum size. TCP 
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has several supported options, which we detail after we see what happens when a 
connection cannot be established. 


13.2.5 Timeout of Connection Estabiishment 

There are several circumstances in which a connection cannot be established. One 
obvious case is when the server host is down. To simulate this scenario, we issue 
our telnet command to a nonexistent host in the same subnet. If we do fhis 
wifhouf modifying fhe ARP fable, fhe clienf exifs wifh a "No roufe fo hosf" error 
message, generafed because no ARP reply is ever refurned for fhe ARP requesf 
(see Chapfer 4). If, however, we place an ARP enfry for a nonexisfenf hosf in fhe 
ARP fable firsf, fhe ARP requesf is nof senf, and fhe sysfem immediafely affempfs 
fo confacf fhe nonexisfenf hosf wifh TCP/IP. Firsf, fhe commands: 

Linux# arp -s 192.168.10.180 00:00:la:lb:Ic:Id 
Linux% date; telnet 192.168.10.180 80; date 

Tue June 7 21:16:34 PDT 2009 
Trying 192.168.10.180... 

telnet: connect to address 192.168.10.180: Connection timed out 

Tue June 7 21:19:43 PDT 2009 

Linux% 


Here fhe MAC address 00:00:1a:1b:1c:Id was chosen simply as a MAC 
address nof being used on fhe LAN; if is of no special consequence. The fimeouf 
occurs abouf 3.2 minufes affer fhe inifial command. Because fhere is no hosf fo 
respond, all of fhe segmenfs generafed are from fhe clienf. Lisfing 13-1 shows fhe 
oufpuf using Wireshark in packef summary (fexf) mode. 


Listing 13-1 Wireshark output for connection establishment that times out 


No. 

Time 

Source 

Destination 

Protocol 

Info 



1 

0.000000 

192.168.10.144 

192.168.10.180 

TCP 

32787 

> 

http 

2 

2.997928 

192.168.10.144 

192.168.10.180 

TCP 

32787 

> 

http 

3 

8.997962 

192.168.10.144 

192.168.10.180 

TCP 

32787 

> 

http 

4 

20.997942 

192.168.10.144 

192.168.10.180 

TCP 

32787 

> 

http 

5 

44.997936 

192.168.10.144 

192.168.10.180 

TCP 

32787 

> 

http 

6 

92.997937 

192.168.10.144 

192.168.10.180 

TCP 

32787 

> 

http 


The inferesfing poinf in fhis oufpuf is how frequenfly fhe clienf's TCP sends a 
SYN fo fry fo esfablish fhe connecfion. The second segmenf is senf 3s affer fhe firsf, 
fhe fhird is senf 6s affer fhe second, fhe fourfh is senf 12s affer fhe fhird, and so 
on. This behavior is called exponential backoff, and we saw somefhing like if before 
when we discussed fhe behavior of Efhernef's CSMA/CD media access confrol 
profocol (see Chapfer 3). In fhaf case, if was a liffle different however, because 
here each backoff is deferminisfically (i.e., always) fwice fhe previous backoff, 
whereas in Efhernef, fhe maximum backoff is doubled and fhe acfual backoff is 
chosen randomly. 
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The number of times to retry an initial SYN can be configured on some sys¬ 
tems and usually has a fairly small value such as 5. In Linux, the system configura¬ 
tion variable net. ipv4. tcp_syn_retr ies gives the maximum number of times 
to attempt to resend a SYN segment during an active open. A corresponding value 
called net.ipv4.tcp_synack_retries gives the maximum number of times 
to attempt to resend a SYN + ACK segment when responding to a peer's active 
open request. It can also be used on an individual connection basis by setting the 
Linux-specific TCP_SYNCNT socket option. Its default value is five retries, as we 
see here. The exponential backoff timing between these retransmissions is part of 
TCP's congestion management response. We shall examine it in detail when we 
discuss Karn's algorithm (see Chapter 16). 

13.2.6 Connections and Translators 

In Chapter 7 we discussed how conventional NAT translates the addresses and 
port numbers used by protocols such as TCP and UDP We also examined how IP 
packets can be translated between IPv6 and IPv4. When NAT is used with TCP, 
the pseudo-header checksum usually requires adjustment (except in cases where 
a checksum-neutral address modifier is used). This is also true for other protocols 
that use pseudo-header checksums, because the computation involves informa¬ 
tion at the transport layer as well as the network layer. 

When a TCP connection is first established, a NAT (or NAT64) can ascertain 
this fact because of the presence of the SYN bit field in a segment. It can also deter¬ 
mine when a connection has become fully established by looking for subsequent 
SYN + ACK and ACK segments containing the appropriate sequence numbers. 
The same applies for the termination of a connection. By implementing a portion 
of the TCP state machine in a NAT (see, for example. Sections 3.5.2.1 and 3.5.2.2 of 
[RFC6146]), the connection can be tracked, including the current states, sequence 
numbers in each direction, and corresponding ACK numbers. Such state tracking 
is typical for NAT implementations. 

Further complications arise when a NAT acts as an editor and rewrites con¬ 
tents in the transport protocol's data payload. For TCP, this may involve removing 
or adding bytes to the data stream, and consequently affecting the sequence num¬ 
bers (and segment) lengths. Doing so also necessarily affects the checksum, but it 
also affects the data sequence. If data is inserted or removed from the data stream 
by the NAT, these values can be adjusted appropriately. Doing so is somewhat 
fragile because if the NAT state becomes desynchronized with the state in the end 
hosts, the connection will not operate properly. 


13.3 TCP Options 

The TCP header can contain options (see Figure 12-3). The only options defined in the 
original TCP specification are the End of Option List (EOL), the No Operation (NOP), 
and the Maximum Segment Size (MSS) options. Since then, several options have been 
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Table 13-1 The TCP option values. Up to 40 bytes are available to hold options. 


Kind 

Length 

Name 

Reference 

Description and Purpose 

0 

1 

EOL 

[REC0793] 

End of Option List 

1 

1 

NOP 

[REC0793] 

No Operation (used for padding) 

2 

4 

MSS 

[REC0793] 

Maximum Segment Size 

3 

3 

WSOPT 

[REC1323] 

Window Scaling Factor (left-shift amount on 
window) 

4 

2 

SACK-Permitted 

[REC2018] 

Sender supports SACK options 

5 

Var. 

SACK 

[REC2018] 

SACK block (out-of-order data received) 

8 

10 

TSOPT 

[REC1323] 

Timestamps option 

28 

4 

UTO 

[RFC5482] 

User Timeout (abort after idle time) 

29 

Var. 

TCP-AO 

[RFC5925] 

Authentication option (using various 
algorithms) 

253 

Var. 

Experimental 

[RFC4727] 

Reserved for experimental use 

254 

Var. 

Experimental 

[RFC4727] 

Reserved for experimental use 


defined. The entire list is maintained by the lANA [TPARAMS]; Table 13-1 gives the 
current options of interest (i.e., those with standards-track RFC descriptions). 

Every option begins with a 1-byte kind that specifies the type of option. Options 
that are not understood are simply ignored, according to [RFC1122]. The options 
with a kind value of 0 and 1 occupy a single byte. The other options have a len byte 
that follows the kind byte. The length is the total length, including the kind and 
len bytes. The reason for the NOP option is to allow the sender to pad fields to a 
multiple of 4 bytes, if it needs to. Remember that the TCP header's length is always 
required to be a multiple of 32 bits because the TCP Header Length field uses that 
unit. The EOL option indicates the end of the list and that no further processing of 
the options list is to be performed. Now we will have a look at the other options. 

13.3.1 Maximum Segment Size (MSS) Option 

The maximum segment size (MSS) is the largest segment that a TCP is willing to 
receive from its peer and, consequently, the largest size its peer should ever use 
when sending. The MSS value counts only TCP data bytes and does not include 
the sizes of any associated TCP or IP header [RFC0879]. When a connection is 
established, each end usually announces its MSS in an MSS option carried with its 
SYN segment. The option allows for 16 bits to be used to specify the MSS value. If 
no MSS option is provided, a default value of 536 bytes is used. Recall the rule that 
requires any host to be capable of processing IPv4 datagrams at least as large as 
576. With minimum-size IPv4 and TCP headers, a TCP using a sending MSS size 
of 536 bytes produces an IPv4 datagram of size 20 + 20 + 536 = 576 bytes. 

The MSS values in Figure 13-5 are all 1460, which is typical for IPv4. The 
resulting IPv4 datagram is normally 40 bytes larger (1500 bytes total, the typical 
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MTU size for Ethernet and path MTU for the Internet): 20 bytes for the TCP header 
and 20 bytes for the IPv4 header. When IPv6 is used, the MSS is usually 1440, 20 
bytes less because of the larger IPv6 header. The special MSS value of 65535 can be 
used with IPv6 jumbograms to indicate an effective MSS of infinity [RFC2675]. In 
this case the SMSS will be determined as the PMTU minus 60 bytes (40 bytes for 
the IPv6 header and 20 bytes for the TCP header). Note that the MSS option is not 
a negotiation between one TCP and its peer; it is a limit. When one TCP gives its 
MSS option to the other, it is indicating its unwillingness to accept any segments 
larger than that size for the duration of the connection. 

13.3.2 Selective Acknowledgment (SACK) Options 

In Chapter 12 we introduced the concept of a sliding window, and we described 
how TCP handles its sequence numbers and acknowledgments. Because it uses 
cumulative ACKs, TCP is never able to acknowledge data it has received correctly 
but that is not contiguous, in terms of sequence numbers, with data it has received 
previously. In such cases, the TCP receiver is said to have holes in its received data 
queue. A receiving TCP prevents applications from consuming data beyond a hole 
because of the byte stream abstraction it provides. 

If a TCP sender were able to learn of the existence of holes (and out-of- 
sequence data blocks beyond holes in the sequence space) at the receiver, it could 
better select which particular TCP segments to retransmit when segments are lost 
or otherwise missing at the receiver. The TCP selective acknowledgment (SACK) 
options [RFC2018][RFC2883] provide this capability. The scheme works effec¬ 
tively, however, only if the TCP sender logic is able to make effective use of the 
SACK information it receives from a SACK-capable receiver. 

A TCP learns that its peer is capable of advertising SACK information by 
receiving the SACK-Permitted option in a SYN (or SYN + ACK) segment. Once 
this has taken place, the TCP receiving out-of-sequence data may provide a SACK 
option that describes the out-of-sequence data to help its peer perform retransmis¬ 
sions more efficiently. SACK information contained in a SACK option consists of a 
range of sequence numbers representing data blocks the receiver has successfully 
received. Each range is called a SACK block and is represented by a pair of 32-bit 
sequence numbers. Thus, a SACK option containing n SACK blocks is (8n + 2) 
bytes long. Two bytes are used to hold the kind and length of the SACK option. 

Because of the limited amount of space available in the option space of a TCP 
header, the maximum number of SACK blocks available to be sent in a single seg¬ 
ment is three (assuming the Timestamps option is also used, described in Section 
13.3.4, which is typical for modern TCP implementations). Although the SACK- 
Permitted option is only ever sent in a SYN segment, the SACK blocks themselves 
may be sent in any segment once the sender has sent the SACK-Permitted option. 
Because the operation of SACK is most easily (and importantly) related to the error 
and congestion control operations of TCP, we discuss it in further detail when we 
cover these topics in Chapters 14 and 16. 
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13.3.3 Window Scale (WSCALE or WSOPT) Option 

The Window Scale option (denoted WSCALE or WSOPT) [RFC1323] effectively 
increases the capacity of fhe TCP Window Advertisement field from 16 fo abouf 30 
bifs. Insfead of changing fhe field size, however, fhe header sfill holds a 16-bif 
value, and an opfion is defined fhaf applies a scaling factor fo fhe 16-bif value. This 
factor effecfively leff-shiffs fhe window field value by fhe scale facfor. This, in 
effecf, mulfiplies fhe window value by fhe value 2®, where s is fhe scale facfor. 
The 1-byfe shiff counf is befween 0 and 14 (inclusive). A shiff counf of 0 indicates 
no scaling. The maximum scale value of 14 provides for a maximum window of 
1,073,725,440 bytes (65,535 x 2'®‘), close fo 1,073,741,823 (23“-l), effecfively 1GB. TCP 
fhen mainfains fhe "real" window size infernally as a 32-bif value. 

This opfion can appear only in a SYN segmenf, so fhe scale facfor is fixed 
in each direcfion when fhe connecfion is esfablished. To enable window scaling, 
bofh ends musf send fhe opfion in fheir SYN segmenfs. The end doing fhe acfive 
open sends fhe opfion in ifs SYN, buf fhe end doing fhe passive open can send fhe 
opfion only if fhe received SYN specifies fhe opfion. The scale facfor can be differ- 
enf in each direcfion. If fhe end doing fhe acfive open sends a nonzero scale facfor 
buf does nof receive a Window Scale opfion from fhe ofher end, if sefs ifs send 
and receive scale values fo 0. This lefs sysfems fhaf do nof undersfand fhe opfion 
inferoperafe wifh sysfems fhaf do. 

Assume we are using fhe Window Scale opfion, wifh a shiff counf of S for 
sending and a shiff counf of R for receiving. Then every 16-bif adverfised window 
fhaf we receive from fhe ofher end is leff-shiffed by R bifs fo obfain fhe real adver¬ 
fised window size. Every fime we send a window adverfisemenf fo fhe ofher end, 
we fake our real 32-bif window size and righf-shiff if S bifs, placing fhe resulfing 
16-bif value in fhe TCP header. 

The shiff counf is aufomafically chosen by TCP, based on fhe size of fhe 
receive buffer. The size of fhis buffer is sef by fhe system, buf fhe capabilify is 
normally provided for fhe applicafion fo change if. The Window Scale opfion is 
mosf relevanf when TCP is used fo provide bulk dafa fransfer over nefworks wifh 
large-bandwidfh-delay producfs (i.e., fhose wifh a producf of round-frip fime and 
bandwidfh being relafively large). Thus, we shall discuss fhe imporfance and use 
of fhis opfion more in Chapfer 16. 

13.3.4 Timestamps Option and Protection against Wrapped Sequence Numbers 
(PAWS) 

The Timestamps option (sometimes called the Timestamp option and written as 
TSOPT or TSopt) lets the sender place two 4-byte timestamp values in every seg¬ 
ment. The receiver reflects these values in the acknowledgment, allowing the 
sender to calculate an estimate of the connection's RTT for each ACK received. 
(We must say "each ACK received" and not "each segment" because TCP often 
acknowledges multiple segments per ACK; we will see this in Chapter 15.) When 
using the Timestamps option, the sender places a 32-bit value in the Timestamp 
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Value field (called TSV or TSval) in the first part of the TSOPT, and the receiver 
echoes this back unchanged in the second Timestamp Echo Retry field (called TSER 
or TSecr). TCP headers containing this option increase by 10 bytes (8 bytes for the 
two timestamp values and 2 to indicate the option value and length). 

The timestamp is a monotonically increasing value. Because the receiver 
simply echoes what it receives, it does not care what the timestamp units or val¬ 
ues actually are. This option does not require any form of clock synchronization 
between the two hosts. [RFC1323] recommends that the sender increment the 
timestamp value by at least 1 every second. Figure 13-6 shows the Timestamps 
option, as displayed by Wireshark. 


tcp-multiple-options.td - Wireshark 


ED® 


File Edit View Go Capture Analyze Statistics Telephony Tools Help 

SKwaiaiii si » 

No, Time Source Destination Protocol Info 

1 0.000000 10.0.0.9 '10.0.0.8 ‘TCP ~ ‘1056 > 6666 [SYN] Seq=0 


2 0.000291 10.0.0.8 10.0.0.9 TCP 6666 > 1056 


3 0.000655 10.0.0.9 10.0.0.8 TCP 1056 > 6666 [ACK] 5eq=l v 

> 

Q Frame 2: 78 bytes on wire (624 bits), 78 bytes captured (624 bits) 

(1 Ethernet II, src: 00:03 ;47:39:5d:a9 (00:03 :47:39:5d:a9), Dst: aa:00:0‘ 
a internet Protocol, src: 10.0.0.8 (10.0.0.8), Dst: 10.0.0.9 (10.0.0.9) 
Q I'ransmission control^ Protocol, Src Port: 6666 (6666), Dst Port: 1056 
Source port: 6666 (6666) 

Destination port: 1056 (1056) 

[stream index: 0] 

sequence number: 0 (relative sequence number) 

Acknowledgement number: 1 (relative ack number) 

Header length: 44 bytes 
a Fl ags: 0x12' (syn,_ ack) 
window size: 65535 

a checksum: 0x842f [validation disabled] 
a options: (24 bytes) 

Maximum segment size: 1460 bytes 
NOP 

Window scale: 2 (multiply by 4) 

NOP 

NOP 

Timestamps: TSval 349742014, TSecr 81813090 

NOP 

NOP 

TCP SACK permitted option: True 
a [SEQ/ACK analysis] 

> 


Figure 13-6 A TCP connection with the Timestamps, Window Scaling, and MSS options being used. 

The TCP header is 44 bytes long. The initial SYN (packet 1) starts with the TSV set to 
81813090. The second packet, highlighted, echoes this value back to the active opener 
and includes its own value of 349742014. 


Here, both ends participate by generating and echoing back the other's 
timestamps. The first segment (client's SYN) uses an initial timestamp value of 
81813090. This value is placed in the TSV. The second portion, TSER, has a value 
of 0 on the first segment because the client does not know the server's timestamp 
value yet. 
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The main reason for wishing to calculate a good estimate of the connection's 
RTT is to set the retransmission timeout, which tells TCP when it should try 
resending a segment that is likely lost. In Chapter 12 we discussed the need to set 
this timeout based on some function of the RTT. With the Timestamps option, we 
can get relatively fine-grain measurements of the RTT. Prior to the creation of the 
Timestamps option, most TCPs would perform just one RTT sample per window 
of data. With the Timestamps option, more samples can be taken, leading to the 
potential of a better RTT estimate (see [RFC1323] and [RFC6298]). 

Because the Timestamps option is most relevant to the setting of the retrans¬ 
mission timer, we discuss its use for that purpose in more detail when we dis¬ 
cuss retransmission in Chapter 14. We say "for that purpose" because although 
the Timestamps option allows for more frequent RTT samples, it also provides 
a way for the receiver to avoid receiving old segments and considering them as 
valid. This is called Protection Against Wrapped Sequence Numbers (PAWS), and it is 
described in [RFC1323] along with the Timestamps option. We'll now take a look 
at how it works. 

Consider a TCP connection using the Window Scale option with the larg¬ 
est possible window, about 1GB (2®°). Also assume that the Timestamps option is 
being used and that the timestamp value assigned by the sender increments by 1 
for each window that is sent. (This is conservative. Normally the timestamp incre¬ 
ments faster than this.) Table 13-2 shows the possible data flow between the two 
hosts when transferring 6GB. To avoid lots of ten-digit numbers, we use the nota¬ 
tion G to mean a multiple of 1,073,741,824. We also use the notation from tcpdump 
that }:K means byte / through and including byte K-1. 


Table 13-2 The TCP Timestamps option can disambiguate segments with the same sequence num¬ 
bers by providing an extra 32 bits of effective sequence number space. 


Time 

Bytes Sent 

Send 

Seq. No. 

Send 

Timestamp 

Receive 

A 

0G:1G 

0G:1G 

1 

OK 

B 

1G:2G 

1G:2G 

2 

OK, but one segment lost and retransmitted 

C 

2G:3G 

2G:3G 

3 

OK 

D 

3G:4G 

3G:4G 

4 

OK 

E 

4G:5G 

0G:1G 

5 

OK 

F 

5G:6G 

1G:2G 

6 

OK, but retransmitted segment reappears 


The 32-bit Sequence Number field wraps between times D and E. We assume 
that one segment gets lost at time B and is retransmitted. We also assume that this 
lost segment reappears at time F. This assumes that the time difference between 
the segment getting lost and reappearing is less than the maximum time a seg¬ 
ment can live in the network (called the MSL; see Section 13.5.2); otherwise the 
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segment would have been discarded by some router when its TTL expired. As we 
mentioned earlier, it is only with relatively high-speed connections that this prob¬ 
lem appears, where old segments can reappear and contain sequence numbers 
currently being transmitted. 

We can also see from Table 13-2 that using the Timestamps option prevents 
this problem. The receiver considers the timestamp as a 32-bit extension of fhe 
sequence number. Because fhe losf segmenf fhaf reappears af fime F has a fime- 
sfamp of 2, which is less fhan fhe mosf recenf valid fimesfamp (5 or 6), if is dis¬ 
carded by fhe PAWS algorifhm. The PAWS algorifhm does nof require any form of 
fime synchronizafion befween fhe sender and fhe receiver. All fhe receiver needs 
is for fhe fimesfamp values fo be monofonically increasing, and fo increase by af 
leasf 1 per window of dafa. 

13.3.5 User Timeout (UTO) Option 

The User Timeout (UTO) option is a relatively new TCP capability described in 
[RFC5482]. The UTO value (also called USER_TIMEOUT) specifies the amount of 
time a TCP sender is willing to wait for an ACK of outstanding data before con¬ 
cluding that the remote end has failed. USER_TIMEOUT has traditionally been a 
local configuration parameter for TCP [RPC0793]. The UTO option allows one TCP 
to signal its USER_TIMEOUT value to its connection peer. This allows the receiv¬ 
ing TCP to adjust its behavior (e.g., to tolerate a longer period of disrupted con¬ 
nectivity prior to aborting a connection). NAT devices could also interpret such 
information to help set their connection activity timers. 

UTO option values are advisory; just because one end of a connection might 
wish to use a large or small UTO value does not mean that the other end needs to 
comply. [RPC1122] refines the definition of USER_TIMEOUT and suggests that a 
TCP reaching a threshold of three (Rl) retransmissions should notify the request¬ 
ing application, and that after 100s (R2) the connection should be closed. Some 
implementations have an API function to change Rl and R2. Because long UTOs 
might lead to resource exhaustion concerns and short UTOs might result in some 
connections being torn down early (a type of DoS attack), upper and lower limits 
are placed on the possible UTO values. The way to set USER_TIMEOUT, then, is 
as follows: 

USER_TIMEOUT = min(U_LIMIT, max(ADV_UTO, REMOTE_UTO, L_LIMIT) ) 

where ADV_UTO is the UTO option advertised to the remote TCP, REMOTE_UTO 
is the peer's advertised UTO option value, U_LIMIT is the local system's upper 
UTO limit, and L_LIMIT is the local system's UTO lower limit. Note that this for¬ 
mula does not guarantee that each end of the same connection will arrive at the 
same USER_TIMEOUT value. In all cases the L_LIMIT value must be greater than 
the associated connection's retransmission timeout (RTO) value (see Chapter 14), 
and it is recommended to be set to 100s to retain compatibility with [RPC1122]. 
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UTO options are included on SYN segments when a connection is estab¬ 
lished, on the first non-SYN segments, and whenever the USER_TIMEOUT value 
is changed. The option value is expressed as a 15-bit value in units of seconds 
or minufes following a bif field ("granularify") fhaf indicafes fhaf fhe value is in 
minufes (1) or seconds (0). As a relafively new opfion, if is nof yef widely deployed. 

13.3.6 Authentication Option (TCP-AO) 

There is an option used to enhance the security of TCP connections. It is designed 
to enhance and replace an earlier mechanism called TCP-MD5 [RPC2385]. Called 
the TCP Authentication Option (TCP-AO) [RPC5925], it uses a cryptographic hash 
algorithm (see Chapter 18), in combination with a secret value known to each 
end of a TCP connection, to authenticate each segment. TCP-AO improves upon 
TCP-MD5 by supporting a variety of cryptographic algorithms and identifying 
changing of keys using in-band signaling. It does not provide a comprehensive 
key management solution, however. That is, each end still has to have a way to 
establish a shared set of keys prior to operation. 

When sending, the TCP derives a traffic key from the shared secret key 
and computes the hash value according to a particular cryptographic algorithm 
[RPC5926]. A receiver, equipped with the same secret key, is likewise able to derive 
the traffic key and use it to ensure that an arriving segment has not been modified 
in transit (with high probability). This option is intended as a strong countermea¬ 
sure to a variety of TCP spoofing attacks (see Section 13.8). However, because it 
requires creation and distribution of a shared key (and is a relatively new option), 
it is not yet widely deployed. 


13.4 Path MTU Discovery with TCP 

In Chapter 3, we described the concept of the path MTU. It is the minimum MTU 
on any network segment that is currently in the path between two hosts. Knowing 
the path MTU can help protocols such as TCP avoid fragmentation. In Chapter 10, 
we looked at how discovery of the path MTU (PMTUD) is accomplished based on 
ICMP messages, but in that case UDP is not usually able to adapt its datagram size 
because the application specifies the size (i.e., not the transport protocol). TCP, in 
providing the byte stream abstraction it implements, determines what segment 
size to use and as a result has a much greater degree of control over the size of IP 
datagrams that are ultimately generated. 

In this section we will examine how PMTUD is used by TCP. Our discus¬ 
sion will apply to both TCP/IPv4 and TCP/IPv6. More details are provided by 
[RPC1191] and [RPC1981], respectively. A method that avoids the use of ICMP, 
called Packetization Layer Path MTU Discovery (PLPMTUD), can also be used by 
TCP [RPC4821] or by other transport protocols. We shall use the ICMPv6 Packet 
Too Big (PTB) terminology to refer to either ICMPv4 Destination Unreachable 
(Pragmentation Required) or ICMPv6 Packet Too Big messages. 
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TCP's regular PMTUD process operates as follows: When a connection is 
established, TCP uses the minimum of the MTU of the outgoing interface, or the 
MSS announced by the other end, as the basis for selecting its send maximum 
segment size (SMSS). PMTUD does not allow TCP to exceed the MSS announced 
by the other end. If the other end does not specify an MSS, the sender assumes a 
default of 536 bytes, but this situation is now rare. It is also possible for an imple¬ 
mentation to save path MTU information on a per-destination basis to help in 
selecting its segment size. Note that the path MTU in each direction of a connec¬ 
tion could be different. 

Once the initial SMSS is chosen, all IPv4 datagrams sent by TCP on that con¬ 
nection have the IPv4 DF bit field set. For TCP/IPv6, this is not necessary because 
there is no DF bit field; all datagrams are assumed to have it set implicitly. If a PTB 
is received, TCP decreases the segment size and retransmits using a different seg¬ 
ment size. If the PTB contains the suggested next-hop MTU, the segment size can 
be set to the next-hop MTU minus the sizes of the IPv4 (or IPv6) and TCP headers. 
If the next-hop MTU value is not present (e.g., an older ICMP error was returned 
that lacks this information), the sender may try a variety of values (e.g., binary 
search for a usable value). This also affects TCP's congestion control management 
(see Chapter 16). For PLPMTUD the situation is similar, except PTB messages are 
not used. Instead, the protocol performing PMTUD must be able to detect message 
discards quickly and perform its own datagram size adjustments. 

Because routes can change dynamically, when some time has passed since 
the last decrease of the segment size, a larger value (up to the initial SMSS) can be 
tried. Guidance in [RFC1191] and [RFC1981] recommends that this time interval 
be about 10 minutes. 

There are a number of problems with PMTUD when it operates in an Internet 
environment with firewalls that block PTB messages [RFC2923]. Of the various 
operational problems with PMTUD, black holes have been the most problematic, 
although the situation is improving (in [LSIO], 80% of systems studied were able 
to properly process PTB messages). PMTUD black holes arise when a TCP imple¬ 
mentation that depends on the delivery of ICMP messages to adjust its segment 
size never receives them. This could be for several reasons, including a firewall or 
NAT configuration that prohibits such ICMP messages from being forwarded. The 
consequence is a TCP connection that cannot proceed once it starts to use larger 
packets. It can be difficult to diagnose because only large packets cannot be for¬ 
warded. The smaller ones (such as SYN and SYN + ACK packets used to establish 
the connection) generally succeed. Some TCP implementations have "black hole 
detection," which amounts to trying a smaller segment size when a segment is 
retransmitted several times. 

13.4.1 Example 

We can see the correct behavior of PMTUD when an intermediate router has an 
MTU less than either of the endpoints' MSS. To create this situation, we begin with 
a router (a Linux host with local address 10.0.0.1) that has a PPPoE interface to a 
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DSL service provider. The PPPoE link uses an MTU of 1492 (1500 bytes for Ether¬ 
net, minus 6 bytes of PPPoE overhead, minus another 2 bytes of PPP overhead; see 
Chapter 3). Eigure 13-7 is an illustration of the topology 



Figure 13-7 The PPPoE encapsulation drops the path MTU of most TCP connections to 1492 bytes 
from what might otherwise have been 1500 bytes (the typical MTU for Ethernet). To 
demonstrate TCP's use of PMTUD, we set the MTU even smaller (288 bytes). 


In order to induce this behavior specifically, we can reduce the MTU size on 
the PPPoE link from 1492 to, say, 288 bytes. On the GW machine, the following 
command accomplishes this task: 


Linux(GW)# ifconfig pppO mtu 288 


In addition, we need to tell the client system (C) that small segments are allowed: 


Linux{C)# sysctl -w net.ipv4.route.min_pmtu=68 


If we did not perform this second operation, Linux would clamp its minimum 
path MTU at the default value of 552 bytes, which helps avoid certain small MTU 
attacks (see Section 13.8). The consequence of doing so in our example here is that 
any packets larger than 288 bytes would be fragmented. To avoid this, and to dem¬ 
onstrate PMTUD more effectively, we remove this minimum. We then start a file 
transfer from machine C (address 10.0.0.123) to the server S on the Internet (address 
169.229.62.97). Listing 13-2 shows a tcpdump packet trace from this exchange. Sev¬ 
eral lines have been wrapped and extraneous fields have been removed for clarity. 


Listing 13-2 The path MTU discovery mechanism finds an appropriate segment size to use when 
transiting the network where the middle link has a smaller MTU than the endpoints. 

1 20:20:21.992721 IP (tos 0x0, ttl 45, id 43565, offset 0, flags [DF], 
proto 6, length: 588) 

169.229.62.97.22 > 10.0.0.123.1027: P [top sum ok] 
41:577(536) ack 23 
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2 20:20:21.993727 IP (tos 0x0, ttl 64, id 57659, offset 0, flags [DF], 

proto 6, length: 588) 

10.0.0.123.1027 > 169.229.62.97.22: P [top sum ok] 
23:559(536) ack 577 

3 20:20:21.994093 IP (tos OxcO, ttl 64, id 57547, offset 0, flags 

[none], proto 1, length: 576) 

10.0.0.1 > 10.0.0.123: icmp 556: 

169.229.62.97 unreachable - need to frag (mtu 288) for 
IP (tos 0x0, ttl 63, id 57659, offset 0, flags [DF], 
proto 6, length: 588) 

10.0.0.123.1027 > 169.229.62.97.22: 

P 23:559(536) ack 577 

4 20:20:21.994884 IP (tos 0x0, ttl 64, id 57660, offset 0, flags [DF], 

proto 6, length: 288) 

10.0.0.123.1027 > 169.229.62.97.22: . [tcp sum ok] 

23:259(236) ack 577 


5 20:20:22.488856 IP (tos 0x0, ttl 45, id 6712, offset 0, flags [DF] , 

proto 6, length: 836) 

169.229.62.97.22 > 10.0.0.123.1027: P [tcp sum ok] 
857:1641(784)ack 855 

6 20:20:29.672947 IP (tos 0x8, ttl 64, id 57679, offset 0, flags [DF], 

proto 6, length: 1452) 

10.0.0.123.1027 > 169.229.62.97.22: . [tcp sum ok] 
1431:2831(1400) ack 2105 

7 20:20:29.674123 IP (tos 0xc8, ttl 64, id 57548, offset 0, flags 

[none], proto 1, length: 576) 

10.0.0.1 > 10.0.0.123: icmp 556: 

169.229.62.97 unreachable - need to frag (mtu 288) for 
IP (tos 0x8, ttl 63, id 57679, offset 0, flags [DF] , 
proto 6, length: 1452) 

10.0.0.123.1027 > 169.229.62.97.22: . 
1431:2831(1400) ack 2105 

8 20:20:29.673751 IP (tos 0x8, ttl 64, id 57680, offset 0, flags [DF], 

proto 6, length: 1452) 

10.0.0.123.1027 > 169.229.62.97.22: . [tcp sum ok] 
2831:4231(1400) ack 2105 

9 20:20:29.675180 IP (tos 0xc8, ttl 64, id 57549, offset 0, flags 

[none], proto 1, length: 576) 

10.0.0.1 > 10.0.0.123: icmp 556: 

169.229.62.97 unreachable - need to frag (mtu 288) for 
IP (tos 0x8, ttl 63, id 57680, offset 0, flags [DF] , 
proto 6, length: 1452) 

10.0.0.123.1027 > 169.229.62.97.22: . 

2831:4231(1400) ack 2105 
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10 20:20:29.674932 IP (tos 0x8, ttl 64, id 57681, offset 0, flags 

[DF], proto 6, length: 288) 

10.0.0.123.1027 > 169.229.62.97.22: . [top sum ok] 
1431:1667(236) ack 2105 

11 20:20:29.675143 IP (tos 0x8, ttl 64, id 57682, offset 0, flags 

[DF], proto 6, length: 288) 

10.0.0.123.1027 > 169.229.62.97.22: . [top sum ok] 
1667:1903(236) ack 2105 


In the tcpdump output, the connection has already been set up and MSS 
options have been exchanged. All packets on the connection have the DF bit field 
set, so both ends are performing PMTUD. The remofe side's firsf packef is 588 
byfes long, which fransifions fhe roufer successfully in one piece, despife our con- 
figurafion of fhe MTU on fhe PPPoE links being 288 byfes. The reason for fhis is 
asymmefry in fhe MTU configurafion. Alfhough fhe local end of fhe PPPoE link 
is using a maximum transmission unif of 288 byfes, fhe ofher end is using a larger 
size SMSS, presumably 1492 byfes. This leaves us in fhe sifuafion where our ouf- 
going packefs need fo be small (288 byfes or less), and packefs fraveling in fhe 
reverse direcfion can be larger. 

When fhe local end affempfs fo send a larger packef of size 588 byfes wifh 
fhe DF bif field fumed on, a PTB message is generafed by fhe roufer (10.0.0.1), 
indicafing fhaf fhe appropriafe MTU for fhe nexf-hop link is 288 byfes. The TCP 
responds by sending ifs nexf packef wifh size 288 byfes, as insfrucfed. To fhen 
send fhe resf of fhe sequence numbers if affempfed fo send in ifs 588-byfe packef, 
if sends fwo addifional packefs, of sizes 288 and 116. We see a similar paffern of 
sizes repeafs during fhe course of fhe file fransfer. 

The PMTU discovery process is one of fhe only ways TCP explicifly affempfs fo 
adapf ifs segmenf size affer a connecfion has sfarfed, af leasf when large amounfs 
of dafa are fransferred. The size of a segmenf can affecf fhe overall fhroughpuf 
performance, as can fhe window size. We discuss how fhese affecf overall perfor¬ 
mance in Chapfer 15. 


13.5 TCP State Transitions 

We have described numerous rules regarding fhe inifiafion and ferminafion of 
a TCP connecfion, and we have seen which fypes of segmenfs are senf during 
differenf phases of a connecfion. The rules fhaf defermine whaf TCP does are 
defermined by whaf sfafe TCP is in. The currenf sfafe is changed based on vari¬ 
ous sfimuli, such as segmenfs fhaf are fransmiffed or received, fimers fhaf expire, 
applicafion reads or wrifes, or informafion from ofher layers. These rules can be 
summarized in TCP's sfafe fransifion diagram. 
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13.5.1 TCP State Transition Diagram 

TCP's state transition diagram is shown in Figure 13-8. States are indicated by 
ovals and transitions between states by arrows. Each endpoint of a connection 
transitions through the states. Some transitions are triggered by the receipt of a 
segmenf wifh cerfain confrol bif fields sef (e.g., SYN, ACK, FIN). Some fransifions 



Figure 13-8 The TCP state transition diagram (also called finite state machine). Arrows represent 
transitions between states due to segment transmission, segment reception, or timers 
expiring. The bold arrows indicate typical client behavior, and the dashed arrows indi¬ 
cate typical server behavior. The boldface directives (e.g., open, close) are actions per¬ 
formed by applications. 
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also cause a segment with particular control bit fields set to be sent. Other transi¬ 
tions may be triggered by application actions or by timers expiring. Each of fhese 
cases is indicafed in fhe diagram as a fexfual annofafion near fhe associafed fran- 
sifion arrow. When inifialized, TCP sfarfs in fhe CLOSED sfafe. Usually an imme- 
diafe fransifion fakes if fo eifher fhe SYN_SENT or LISTEN sfafe, depending on 
whefher fhe TCP is asked fo perform an acfive or passive open, respecfively. 

Nofe in fhis diagram fhaf only a subsef of fhe sfafe fransifions is 'Typical." 
We have marked fhe normal clienf fransifions wifh a darker solid arrow, and fhe 
normal server fransifions wifh a dashed arrow. The fwo fransifions leading fo fhe 
ESTABLISHED sfafe correspond fo opening a connecfion, and fhe fwo fransifions 
leading from fhe ESTABLISHED sfafe are for fhe ferminafion of a connecfion. The 
ESTABLISHED sfafe is where dafa fransfer can occur befween fhe fwo ends in 
bofh direcfions. Chapfers 14-17 describe whaf happens in fhis sfafe. 

We have labeled fhe EIN_WAIT_1, EIN_WAIT_2, and TIME_WAIT sfafes 
as being (af leasf parfially) in a box called "Acfive Close." These are fhe sef of 
sfafes enfered when fhe local applicafion inifiafes a close requesf. Two ofher sfafes 
(CLOSE_WAIT and LAST_ACK) are collecfed in a dashed box wifh fhe label "Pas¬ 
sive Close." These sfafes correspond fo waifing for a peer fo acknowledge a PIN 
segmenf and perform ifs close. Simulfaneous close, which is a form of double 
acfive close, uses fhe CLOSING sfafe. 

The names of fhe 11 sfafes (CLOSED, LISTEN, SYN_SENT, efc.) in fhis figure 
are based on fhe names oufpuf by fhe netstat command in UNIX, Linux, and 
Windows, which are fhemselves based on fhe names originally used in [RPC0793]. 
The sfafe CLOSED is nof really an "official" sfafe buf has been added as a useful 
sfarfing poinf and ending poinf for fhe diagram. 

The sfafe fransifion from LISTEN fo SYN_SENT is legal in fhe TCP protocol 
buf is nof supporfed by Berkeley sockefs and is rarely seen. The fransifion from 
SYN_RCVD back fo LISTEN is valid only if fhe SYN_RCVD sfafe was enfered 
from fhe LISTEN sfafe (fhe normal scenario), nof from fhe SYN_SENT sfafe (a 
simulfaneous open). This means fhaf if we perform a passive open (enter LISTEN), 
receive a SYN, send a SYN wifh an ACK (enter SYN_RCVD), and fhen receive a 
resef instead of an ACK, fhe endpoinf refurns fo fhe LISTEN sfafe and waifs for 
anofher connecfion requesf fo arrive. 

Pigure 13-9 shows fhe normal TCP connecfion esfablishmenf and ferminafion, 
defailing fhe differenf sfafes fhrough which fhe clienf and server pass. If is a simpler 
version of Pigure 13-1 showing fhe relevanf sfafes buf nof fhe opfions or ISN defails. 
We assume in Pigure 13-9 fhaf fhe clienf on fhe leff side does an acfive open and 
fhe server on fhe righf side does a passive open. Alfhough we show fhe clienf 
doing fhe acfive close, as we menfioned earlier, eifher side can do fhe acfive close. 

13.5.2 TIME_WAIT (2MSL Wait) State 

The TIME_WAIT state is also called the 2MSL wait state. It is a state in which TCP 
waits for a time equal to twice the Maximum Segment Lifetime (MSL), sometimes 
called timed wait. Every implementation must choose a value for the MSL. It is 
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Figure 13-9 TCP states corresponding to normal connection establishment and termination 


the maximum amount of time any segment can exist in the network before being 
discarded. We know that this time limit is bounded, because TCP segments are 
transmitted as IP datagrams, and the IP datagram has the TTL field or Hop Limit 
field that limits its effective lifetime (see Chapter 5). [RFC0793] specifies the MSL 
as 2 minutes. Common implementation values, however, are 30s, 1 minute, or 2 
minutes. In most cases, the value can be modified. On Linux, the value net. ipv4. 
tcp_fin_timeout holds the 2MSL wait timeout value (in seconds). On Win¬ 
dows, the following registry key: 


HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpTimedWaitDelaY 


holds the timeout. It is permitted to be in the range of 30 to 300s. For IPv6, replace 
the term Tcpip with Tcpip6. 

Given the MSL value for an implementation, the rule is: When TCP performs 
an active close and sends the final ACK, that connection must stay in the TIME_ 
WAIT state for twice the MSL. This lets TCP resend the final ACK in case it is lost. 
The final ACK is resent not because the TCP retransmits ACKs (they do not con¬ 
sume sequence numbers and are not retransmitted by TCP), but because the other 
side will retransmit its FIN (which does consume a sequence number). Indeed, 
TCP will always retransmit PINs until it receives a final ACK. 
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Another effect of fhis 2MSL waif sfafe is fhaf while fhe TCP implemenfafion 
waifs, fhe endpoinfs defining fhaf connecfion (clienf IP address, clienf porf num¬ 
ber, server IP address, and server porf number) cannof be reused. Thaf connecfion 
can be reused only when fhe 2MSL waif is over, or when a new connecfion uses 
an ISN fhaf exceeds fhe highesf sequence number used on fhe previous insfanfia- 
fion of fhe connecfion [RFC1122], or if fhe use of fhe Timesfamps opfion allows 
fhe disambiguafion of segmenfs from a previous connecfion insfanfiafion fo nof 
ofherwise be confused [RFC6191]. Unforfunafely, some implemenfafions impose a 
more sfringenf consfrainf. In fhese sysfems, a local porf number cannof be reused 
while fhaf porf number is fhe local porf number of any endpoinf fhaf is in fhe 
2MSL waif sfafe on fhe sysfem. We will see examples of fhis consfrainf in Lisfings 
13-3 and 13-4. 

Mosf implemenfafions and APIs provide a way fo bypass fhis resfricfion. Wifh 
fhe Berkeley sockefs API, fhe SO_REUSEADDR sockef opfion enables fhe bypass 
operafion. If lefs fhe caller assign ifself a local porf number even if fhaf porf num¬ 
ber is parf of some connecfion in fhe 2MSL waif sfafe. We will see, however, fhaf 
even wifh fhis bypass mechanism for one sockef (address, porf number pair), fhe 
rules of TCP sfill (should) prevenf fhis porf number from being reused by anofher 
insfanfiafion of fhe same connecfion fhaf is in fhe 2MSL waif sfafe. Any delayed 
segmenfs fhaf arrive for a connecfion while if is in fhe 2MSL waif sfafe are dis¬ 
carded. Because fhe connecfion defined by fhe address/porf 4-fuple in fhe 2MSL 
waif sfafe cannof be reused during fhis fime period, when a valid connecfion is 
finally esfablished, we know fhaf delayed segmenfs from an earlier insfanfiafion 
of fhis connecfion cannof be misinferprefed as being parf of fhe new connecfion. 

Por inferacfive applicafions, if is normally fhe clienf fhaf does fhe acfive close 
and enfers fhe TIME_WAIT sfafe. The server usually does fhe passive close and 
does nof go fhrough fhe TIME_WAIT sfafe. The implicafion is fhaf if we ferminafe 
a clienf, and resfarf fhe same clienf immediafely, fhaf new clienf cannof reuse fhe 
same local porf number. This is nof ordinarily a problem, because clienfs normally 
use ephemeral porfs assigned by fhe operafing sysfem and do nof care whaf fhe 
assigned porf number is. (Recall, if is acfually a recommended pracfice for fhem 
fo be randomized for securify reasons [RPC6056].) This is imporfanf fo know 
because a clienf fhaf makes a large number of connecfions quickly (especially fo 
fhe same server) could conceivably have fo delay while ofher connecfions fermi¬ 
nafe if ephemeral porfs are in shorf supply. 

Wifh servers, however, fhe sifuafion is differenf. They almosf always use well- 
known porfs. If we ferminafe a server process fhaf has a connecfion esfablished 
and immediafely fry fo resfarf if, fhe server cannof assign ifs assigned porf num¬ 
ber fo ifs endpoinf (if gefs an "Address already in use" binding error), because fhaf 
porf number is parf of a connecfion fhaf is in a 2MSL waif sfafe. If may fake from 
1 fo 4 minufes for fhe server fo be able fo resfarf, depending on fhe local sysfem's 
value for fhe MSL. We can see fhis scenario using our sock program. In Lisfing 
13-3 we sfarf fhe server, connecf fo if from a clienf, and fhen ferminafe fhe server. 
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Listing 13-3 A TCP connection must complete a 2MSL delay in the TIME_WAIT state before a port 
number can be reused by another process. 

Linux% sock -v -s 6666 

(now a client on another computer connects to this server) 

connection on 192.168.10.144.6666 from 192.168.10.140.2623 
(server stopped by typing interrupt character) 

(now server is restarted) 

Linux% sock -v -s 6666 

can't bind local address: Address already in use 
Linux% netstat -n -t 

Active Internet connections (w/o servers) 

Proto Recv-Q Send-Q Local Address Foreign Address State 

tcp 0 0 192.168.10.144:6666 192.168.10.140:2623 TIME_WAIT 

(wait one minute and restart server again) 

Linux% sock -v -s 6666 


When we try to restart the server, the program outputs an error message indi¬ 
cating that it cannot bind its port number because the address is already in use. 
This really means that the address and port number combination is already in 
use; it is in a 2MSL wait state because of the previous connection. This is the more 
stringent restriction on port number reuse mentioned before. The oufpuf from 
fhe netstat command shows fhaf fhe connecfion is in fhe T1ME_WAIT sfafe. 
Alfhough clienfs do nof fypically experience as many issues wifh 2MSL waif sfafes 
as servers do, we can demonsfrafe fhe same issue by having fhe clienf specify ifs 
own porf number, as shown in Lisfing 13-4. 

Listing 13-4 A client cannot reuse a port number while it is still being used by another connection 
in the 2MSL wait state. 


(start server in one window) 

Linux% sock -s -v 6666 

(connect to it from another window) 

Linux% sock -v 127.0.0.1 6666 

(server identifies incoming connection) 

connection on 127.0.0.1.6666 from 127.0.0.1.2091 

(client identifies connection establishment, and is interrupted) 
connected on 127.0.0.1.2091 to 127.0.0.1.6666 

*C 

(server identifies connection has terminated and exits) 

connection closed by peer 

Linux% 
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(client is restarted, specifying same port number as before) 
Linux% sock -b 2091 -v 127.0.0.1 6666 
bind() error: Address already in use 

(wait 30 seconds and try again) 

Linux% sock -b 2091 -v 192.168.10.144 6666 

connect() error: Connection refused 


The first time we execute the client we specify the -v option to see what the 
local (ephemeral) port number assigned to the client is (2091). The second time we 
execute the client we specify the -b option, telling the client to assign itself 2091 
as its local port number instead of being given another ephemeral port number by 
the operating system. As we expect, the client cannot do this, because port 2091 
is part of a connection that is in a 2MSL wait state. Once the wait is over (1 minute 
on this Linux machine), the client attempts to connect again, but the server exited 
when the connection was interrupted the first time, so it is refused. We shall see 
how TCP reset segments are used to signal this connection refused condition in 
Section 13.6. 

We mentioned earlier that most systems provide a way of overriding the 
default behavior, which allows processes to bind to ports even if those ports are 
part of connections in the 2MSL wait state. Now we try the same scenario as 
before, but using the -A option to sock, which enables the bypass mechanism: 


Linux% sock -A -v -s 6666 
Linux% sock -A -v -s 6666 


In this example, we start the server with the -A option, which enables the 
SO_REUSEADDR socket option that we mentioned. By doing this, we allow the 
server to bind to its port even though it is part of a connection in the 2MSL wait 
state. If we try to use the client right away with the same port, however, the fol¬ 
lowing happens: 


Linux% sock -b 32840 -v 127.0.0.1 6666 

bind{) error: Address already in use 


Once again, the endpoint 12 7.0.0.1.3 2 8 4 0 is in use, so the client fails. If, how¬ 
ever, we also use the -A option for the client, we can force the connection to work: 

Linux% sock -A -b 32840 -v 127.0.0.1 6666 

Connected on 127.0.0.1.32840 to 127.0.0.1.6666 
TCP_MAXSEG = 16383 

Here we see that even though the same connection (4-tuple) is being used 
again before the 2MSL wait state expires, the use of the -A option has forced the 
connection to be allowed. Of course, this is all taking place on the same computer, 
so the operating system is able to ascertain what processes represent what ends 
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of the connections in the 2MSL wait state and (potentially, at least) keep them 
separate. What if we try the same thing again but establish the connection from 
another host? Here we test this idea: 

(start server on first machine) 

Linux% sock -v -s 6666 

(connect to it from second - Windows - machine) 

C:\> sock -A -V 10.0.0.1 6666 

(server identifies incoming connection) 

connection on 10.0.0.1.6666 from 10.0.0.3.2172 

(client identifies connection establishment, and is interrupted) 
connected on 10.0.0.3.2172 to 10.0.0.1.6666 

*C 

C: \> 


(server identifies connection has terminated and exits) 

connection closed by peer 
Linux% 

(client is restarted, specifying same port number as before) 

C:\> sock -A -b 2091 -v 10.0.0.1 6666 

connect() error: Address already in use 
C:\> sock -A -b 2091 -v 10.0.0.1 6666 

connect() error: Address already in use 

(wait 30 seconds and try again) 

C:\> sock -A -b 2091 -v 10.0.0.1 6666 

connect() error: Connection refused 


This example is similar to the previous one, except the client and server are on 
different machines. We observe that irrespective of the -A flag on the client, the 
2MSL wait time is induced. Here the 2MSL wait lasts for 30s. After that, the client 
attempts to contact the server, which has already exited. 

One interesting thing happens if we switch the client and server machines. 
We will now use Windows as the server and Linux as the client and repeat the 
experiment: 


(start server on Windows machine) 

C:\> sock -V “S 6666 

(connect to it from second - Linux - machine) 

Linux% sock -A -v 192.168.10.145 6666 

(server identifies incoming connection) 

connection on 192.168.10.145.6666 from 192.168.10.145.32843 
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(client identifies connection establishment, and is interrupted) 

connected on 192.168.10.144.32843 to 192.168.10.145.6666 

*C 

Linux% 

(server identifies connection has terminated and exits) 

connection closed by peer 
C:\> 


(client is restarted, specifying same port number as before) 

Linux% sock -A -b 32843 -v 192.168.10.144 6666 

bind() error: Connection refused 


At this point we would expect local port 32843 to be unavailable, but because 
of the way -A works on Linux, we are allowed to make use of if. This is a violafion 
of fhe original TCP specificafion, buf if is allowed by [RFC1122] and [RFC6191], as 
menfioned before. These specificafions allow a new connecfion requesf fo arrive 
and be accepfed for a connecfion fhaf is in fhe TIME_WAIT sfafe, if fhere is a 
sfrong reason fo believe fhaf segmenfs on fhe new connecfion will nof be confused 
wifh segmenfs on fhe previous insfanfiafion of fhe connecfion based on a combi- 
nafion of fhe sequence numbers and fimesfamps. [RFC1337] and fhe appendix of 
[RFC1323] show some of fhe piffalls relafed fo fhis rule. 

13.5.3 Quiet Time Concept 

The 2MSL wait provides protection against delayed segments from an earlier 
instantiation of a connection being interpreted as part of a new connection that 
uses the same local and foreign IP addresses and port numbers. But this works 
only if a host with connections in the 2MSL wait does not crash. 

What if a host with connections in the TIME_WAIT state crashes, reboots 
within the MSL, and immediately establishes new connections using the same 
local and foreign IP addresses and port numbers corresponding to the local con¬ 
nections that were in the TIME_WAIT state before the crash? In this scenario, 
delayed segments from the connections that existed before the crash can be mis¬ 
interpreted as belonging to the new connections created after the reboot. This can 
happen regardless of how the initial sequence number is chosen after the reboot. 

To protect against this scenario, [RFC0793] states that TCP should wait an 
amount of time equal to the MSL before creating any new connections after a 
reboot or crash. This is called the quiet time. Few implementations abide by this 
because most hosts take longer than the MSL to reboot after a crash. Also, if appli¬ 
cations use their own checksums or encryption, errors such as these are easily 
detected. 
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13.5.4 FIN_WAIT_2 State 

In the FIN_WAIT_2 state, TCP has sent a FIN and the other end has acknowledged 
it. Unless a half-close is being performed, fhe TCP musf waif for fhe applicafion 
on fhe ofher end fo recognize fhaf if has received an end-of-file nofificafion and 
close ifs end of fhe connecfion, which causes a FIN fo be senf. Only when fhe 
applicafion performs fhis close (and ifs FIN is received) does fhe acfive closing 
TCP move from fhe FIN_WAIT_2 fo fhe TIME_WAIT sfafe. This means fhaf one 
end of fhe connecfion can remain in fhis sfafe forever. The ofher end is sfill in fhe 
CLOSE_WAIT sfafe and can remain fhere forever, unfil fhe applicafion decides fo 
issue ifs close. 

Many implemenfafions prevenf fhis infinife waif in fhe EIN_WAIT_2 sfafe as 
follows: If fhe applicafion fhaf does fhe acfive close does a complefe close, nof a 
half-close indicafing fhaf if expecfs fo receive dafa, a fimer is sef. If fhe connecfion 
is idle when fhe fimer expires, TCP moves fhe connecfion info fhe CLOSED sfafe. 
In Linux, fhe variable net.ipv4.tcp_f in_timeout can be adjusfed fo confrol 
fhe number of seconds fo which fhe fimer is sef. Ifs defaulf value is 60s. 

13.5.5 Simultaneous Open and Close Transitions 

We have seen the normal uses for the SYN_SENT and SYN_RCVD states that 
correspond to sending and receiving SYN segments, respectively. As illustrated 
in Eigure 13-3, TCP was purposely designed to handle simultaneous opens that 
result in a single connection. When a simultaneous open occurs, the state tran¬ 
sitions differ from those shown in Eigure 13-9. Both ends send a SYN at about 
the same time, entering the SYN_SENT state. When each end receives its peer's 
SYN segments, the state changes to SYN_RCVD, and each end resends a SYN and 
acknowledges the received SYN. When each end receives the SYN plus the ACK, 
the state changes to ESTABLISHED. 

Eor a simultaneous close, in terms of Eigure 13-6, both ends go from ESTAB¬ 
LISHED to EIN_WAIT_1 when the application issues the close. This causes both 
EINs to be sent, and they probably pass each other somewhere in the network. 
When its peer's EIN arrives, each end transitions from EIN_WAIT_1 to the CLOS¬ 
ING state, and each endpoint sends its final ACK. Upon receiving a final ACK, 
each endpoint's state changes to TIME_WAIT, and the 2MSL wait is initiated. 


13.6 Reset Segments 

We mentioned the RST bit field in the TCP header in Chapter 12. A segment hav¬ 
ing this bit set to "on" is called a "reset segment" or simply a "reset." In general, a 
reset is sent by TCP whenever a segment arrives that does not appear to be correct 
for the referenced connection. (We use the term referenced connection to mean the 
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connection specified by the 4-tuple in the TCP and IP headers of the reset.) Resets 
ordinarily result in a fast teardown of a TCP connection. We can construct sce¬ 
narios to demonstrate the use of reset segments. 

13.6.1 Connection Request to Nonexistent Port 

A common case for generating a reset segment is when a connection request 
arrives and no process is listening on the destination port. We saw this previously 
when we encountered the "connection refused" error messages. These are com¬ 
mon with TCP. In the case of UDP, we saw in Chapter 10 that an ICMP Destination 
Unreachable (Port Unreachable) message is generated when a datagram arrives 
for a destination port that is not in use. TCP uses a reset segment instead. 

An example of this is trivial to generate—we use the Telnet client and specify 
a port number that is not in use on the destination. This destination can just as 
well be the local computer: 


Linux% telnet localhost 9999 

Trying 127.0.0.1.. . 

telnet: connect to address 127.0.0.1: Connection refused 

This error message is output by the Telnet client immediately. Listing 13-5 
shows the packet exchange corresponding to this command. 


Listing 13-5 Reset generated by attempt to open connection to nonexistent port 

1 22:15:16.348064 127.0.0.1.32803 > 127.0.0.1.9999: 

S [tcp sum ok] 3357881819:3357881819(0) win 32767 
<mss 16396,sackOK,timestamp 16945235 0,nop,wscale 0> 
(DF) [tos 0x10] (ttl 64, id 42376, len 60) 

2 22:15:16.348105 127.0.0.1.9999 > 127.0.0.1.32803: 

R [tcp sum ok] 0:0(0) ack 3357881820 win 0 
(DF) [tos 0x10] (ttl 64, id 0, len 40) 


The values we need to examine in Listing 13-5 are the Sequence Number field 
and ACK Number field in the reset (second) segment. Because the ACK bit field 
was not on in the arriving SYN segment, the sequence number of the reset is set 
to 0 and the ACK number is set to the incoming ISN plus the number of data bytes 
in the segment. Although there is no data in the arriving segment, the SYN bit 
logically occupies 1 byte of sequence number space; therefore, in this example the 
ACK number in the reset segment is set to the ISN, plus the data length (0), plus 
1 for the SYN bit. 

For a reset segment to be accepted by a TCP, the ACK bit field must be set and 
the ACK Number field must be within the valid window (see Chapter 12). This 
helps to prevent a simple attack in which anyone able to generate a reset matching 
the appropriate connection (4-tuple) could disrupt a connection [RFC5961]. 
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13.6.2 Aborting a Connection 

We saw in Figure 13-1 that the normal way to terminate a connection is for one 
side to send a FIN. This is sometimes called an orderly release because the FIN is 
sent after all previously queued data has been sent, and there is normally no loss 
of dafa. Buf if is also possible fo aborf a connecfion by sending a resef insfead of a 
FIN af any fime. This is somefimes called an abortive release. 

Aborfing a connecfion provides fwo feafures fo fhe applicafion: (1) any queued 
dafa is fhrown away and a resef segmenf is senf immediafely, and (2) fhe receiver 
of fhe resef can fell fhaf fhe ofher end did an aborf insfead of a normal close. 
The API being used by fhe applicafion musf provide a way fo generafe fhe aborf 
insfead of a normal close. 

The sockefs API provides fhis capabilify by using fhe "linger on close" sockef 
opfion (SO_LINGER) wifh a 0 linger value. Essenfially fhis means "Linger for 
no fime in making sure dafa gefs fo fhe ofher side, fhen aborf." In fhe following 
example, we show whaf happens when a remofe command fhaf generafes a large 
amounf of oufpuf is canceled by fhe user: 


Linux% ssh linux cat /usr/share/dict/words 

Aarhus 

Aaron 

Ababa 

aback 

abaft 

abandon 

abandoned 

abandoning 

abandonment 

abandons 

... continues ... 

*C 

Killed by signal 2. 

Flere fhe user has decided fo aborf fhe oufpuf of fhis command. The words 
file has 45,427 words in if, so fhis command was probably some sorf of misfake. 
When fhe user fypes fhe inferrupf characfer, fhe sysfem indicafes fhaf fhe process 
(here, fhe ssh program) has been killed by signal number 2. This signal is called 
SIGINT and usually ferminafes a program when if is delivered. Lisfing 13-6 shows 
fhe tcpdump oufpuf for fhis example. (We have delefed many of fhe infermediafe 
packefs, because fhey add nofhing fo fhe discussion.) 


Listing 13-6 Aborting a connection with a reset (RST) instead of a FIN 
Linux# tcpdrunp -vw -s 1500 tcp 

I 22:33:06.386747 192.168.10.140.2788 > 192.168.10.144.ssh: 

S [tcp sum ok] 1520364313:1520364313(0) win 65535 
<mss 1460,nop,nop,sackOK> 

(DF) (ttl 128, id 43922, len 48) 
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2 22:33:06.386855 192.168.10.144.ssh > 192.168.10.140.2788: 

S [tcp sum ok] 181637276:181637276(0) ack 1520364314 
win 5840 

<mss 1460,nop,nop,sackOK> 

(DF) (ttl 64, id 0, len 48) 

3 22:33:06.387676 192.168.10.140.2788 > 192.168.10.144.ssh: 

. [tcp sum ok] 1:1(0) ack 1 win 65535 
(DF) (ttl 128, id 43923, len 40) 

(... ssh encrypted authentication exchange and hulk data transfer ...) 

4 22:33:13.648247 192.168.10.140.2788 > 192.168.10.144.ssh: 

R [tcp sum ok] 1343:1343(0) ack 132929 win 0 
(DF) (ttl 128, id 44004, len 40) 


Segments 1-3 show the normal connection establishment. When the interrupt 
character is hit, the connection is aborted. The reset segment contains a sequence 
number and acknowledgment number. Also notice that the reset segment elicits 
no response from the other end—it is not acknowledged at all. The receiver of fhe 
resef aborfs fhe connecfion and advises fhe applicafion fhaf fhe connecfion was 
resef. This offen resulfs in fhe error indicafion "Connecfion resef by peer" or a 
similar message. 

13.6.3 Half-Open Connections 

A TCP connecfion is said fo be half-open if one end has closed or aborfed fhe con¬ 
necfion wifhouf fhe knowledge of fhe ofher end. This can happen anyfime one of 
fhe peers crashes. As long as fhere is no affempf fo fransfer dafa across a half-open 
connecfion, fhe end fhaf is sfill up does nof defecf fhaf fhe ofher end has crashed. 

Anofher common cause of a half-open connecfion is when one hosf is pow¬ 
ered off insfead of shuf down properly. This happens, for example, when PCs are 
being used fo run remofe login clienfs and are swifched off af fhe end of fhe day. 
If fhere was no dafa fransfer going on when fhe power was cuf, fhe server will 
never know fhaf fhe clienf disappeared (if would sfill fhink fhe connecfion is in 
fhe ESTABLISHED sfafe). When fhe user comes in fhe nexf morning, powers on 
fhe PC, and sfarfs a new session, a new occurrence of fhe server is sfarfed on fhe 
server hosf. This can lead fo many half-open TCP connecfions on fhe server hosf. 
(In Chapfer 17 we will see a way for one end of a TCP connecfion fo discover fhaf 
fhe ofher end has disappeared using TCP's keepalive opfion.) 

We can easily creafe a half-open connecfion. In fhis case, we do so on fhe 
clienf rafher fhan fhe server. We will execufe fhe Telnef clienf on 10.0.0.1, con- 
necfing fo fhe Sun RPC Service (sunrpc, porf 111) server af 10.0.0.7 (see Lisfing 
13-7). We type one line of inpuf and wafch if go across wifh tcpdump, and fhen 
we disconnecf fhe Efhernef cable on fhe server's hosf and reboof fhe server hosf. 
This simulafes fhe server hosf crashing. (We disconnecf fhe Efhernef cable before 
reboofing fhe server fo prevenf if from sending a EIN ouf of fhe open connecfions. 
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which some TCPs do when they are shut down.) After the server has rebooted, we 
reconnect the cable and try to send another line from fhe clienf fo fhe server. Affer 
reboofing, fhe server's TCP has losf all memory of fhe connecfions fhaf exisfed 
before, so if knows nofhing abouf fhe connecfion fhaf fhe dafa segmenf references. 
The rule of TCP is fhaf fhe receiver responds wifh a resef. 


Listing 13-7 The server host is disconnected and rebooted, leaving a half-open connection at the 
client. When it receives additional data on the connection it now knows nothing about, 
the server responds with a reset segment, closing the connection at both ends. 

Linux% telnet 10.0.0.7 sunrpc 

Trying 10.0.0.7... 

Connected to 10.0.0.7. 

Escape character is 
foo 

(Ethernet cable disconnected and server rebooted) 
bar 

Connection closed by remote host 


Lisfing 13-8 shows fhe tcpdump oufpuf for fhis example. 


Listing 13-8 Reset in response to data segment on a half-open connection 

1 23:15:48.804142 IP (tos 0x10, ttl 64, id 20095, offset 0, 

flags [DF], proto 6, length: 60) 

10.0.0.1.1310 > 10.0.0.7.sunrpc: 

S [tcp sum ok] 2365970104:2365970104(0) win 5840 
<mss 1460,sackOK,timestamp 3849492679 0,nop,wscale 2> 

2 23:15:48.804742 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], 

proto 6, length: 60) 

10.0.0.7.sunrpc > 10.0.0.1.1310: 

S [tcp sum ok] 2093796387:2093796387(0) ack 2365970105 win 5792 
<mss 1460,sackOK,timestamp 654784 3849492679,nop,wscale 0> 

3 23:15:48.805028 IP (tos 0x10, ttl 64, id 20097, offset 0, 

flags [DF], proto 6, length: 52) 

10.0.0.1.1310 > 10.0.0.7.sunrpc: 

. [tcp sum ok] 1:1(0) ack 1 win 1460 
<nop,nop,timestamp 3849492680 654784> 

4 23:15:51.999394 IP (tos 0x10, ttl 64, id 20099, offset 0, 

flags [DF], proto 6, length: 57) 

10.0.0.1.1310 > 10.0.0.7.sunrpc: 

P [tcp sum ok] 1:6(5) ack 1 win 1460 
<nop,nop,timestamp 3849495875 654784> 

5 23:15:51.999874 IP (tos 0x0, ttl 64, id 12773, offset 0, 

flags [DF], proto 6, length: 52) 

10.0.0.7.sunrpc > 10.0.0.1.1310: 

. [tcp sum ok] 1:1(0) ack 6 win 5792 
<nop,nop,timestamp 656421 3849495875> 
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6 23:17:19.419611 arp who-has 10.0.0.7 (Broadcast) tell 0.0.0.0 

7 23:17:20.419142 arp who-has 10.0.0.7 (Broadcast) tell 0.0.0.0 

8 23:17:21.427458 arp reply 10.0.0.7 is-at 00:eO:00:88:ad:d6 

9 23:17:21.921745 arp who-has 10.0.0.1 tell 10.0.0.7 

10 23:17:21.921892 arp reply 10.0.0.1 is-at 00:04:5a:9f:9e:80 

11 23:17:23.437114 arp who-has 10.0.0.7 (Broadcast) tell 10.0.0.7 

12 23:17:34.804196 arp who-has 10.0.0.7 tell 10.0.0.1 

13 23:17:34.804650 arp reply 10.0.0.7 is-at 00:eO:00:88:ad:d6 

14 23:17:43.684786 IP (tos 0x10, ttl 64, id 20101, offset 0, 

flags [DF], proto 6, length: 57) 

10.0.0.1.1310 > 10.0.0.7.sunrpc: 

P [tcp sum ok] 6:11(5) ack 1 win 1460 
<nop,nop,timestamp 3849607577 656421> 

15 23:17:43.685277 IP (tos 0x10, ttl 64, id 0, offset 0, 

flags [DF], proto 6, length: 40) 

10.0.0.7.sunrpc > 10.0.0.1.1310: 

R [tcp sum ok] 2093796388:2093796388(0) win 0 


Segments 1-3 are the normal connection establishment. Segment 4 sends the 
line "too" to the sunrpc server (the 5 bytes required include a carriage return and 
newline character), and segment 5 is the acknowledgment. 

At this point we disconnect the Ethernet cable from the server (address 
10.0.0.7), reboot, and reconnect the cable. This takes about 90s. We then type the 
next line of inpuf fo fhe clienf ("bar"), and when we type the return key the line is 
sent to the server (the first TCP segment after the ARP traffic in Lisfing 13-9). This 
elicifs a resef response from fhe server, which no longer has any knowledge of fhe 
exisfence of fhe connecfion. 

Nofe fhaf when fhe hosf reboofs, if uses grafuifous ARP (see Chapfer 4) in 
order fo defermine if ifs IPv4 address is already in use on fhe segmenf, and fo 
supply if fo ofhers. If also requesfs fhe MAC address for IPv4 address 10.0.0.1 
because fhaf is ifs defaulf roufer fo fhe Infernef. 

13.6.4 TIME-WAIT Assassination (TWA) 

As menfioned previously, fhe TIME_WAIT sfafe is infended fo allow any dafa- 
grams lingering from a closed connecfion fo be discarded. During fhis period, fhe 
waifing TCP usually has liffle fo do; if merely holds fhe sfafe unfil fhe 2MSL fimer 
expires. If, however, if receives cerfain segmenfs from fhe connecfion during fhis 
period, or more specifically an RST segmenf, if can become desynchronized. This 
is called TIME-WAIT Assassination (TWA) [RPC1337]. Consider fhe exchange of 
packefs shown in Pigure 13-10. 
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Active Opener 
(Client) 


Passive Opener 
(Server) 


TIME WAIT 


CLOSED 

(Prematurely) 


[Data Transfer Takes Place] 

FIN + ACK, Seq = K- 1, ACK = 1-1 

_ACK, Seq = L - 1, ACK = K- - 

- FIN + ACK, Seq = L -1, hCK^- - 

ACK, Seq = K, ACK = /._ 

, ' 








0h> 




- ACK, Seq = K, ACK = L _ 

-RST + ACK, Seq = L, ACK = K- 


CLOSED 


??? (Unknown) 


Figure 13-10 An RST segment can "assassinate" the TIME_WAIT state and force the connection to 
close prematurely. Various methods exist to resist this problem, including ignoring 
RST segments when in the TIME_WAIT state. 


In the example shown in Figure 13-10, the server has completed its role in the 
connection and cleared any state. The client remains in the TIME_WAIT state. 
When the FIN exchange completes, the client's next sequence number is K and 
the server's is L. The late-arriving segment is sent from the server to the client 
using sequence number L -100 and containing ACK number K - 200. When the cli¬ 
ent receives this segment, it determines that both the sequence number and ACK 
values are "old." When receiving such old segments, TCP responds by sending an 
ACK with the most current sequence number and ACK values (K and L, respec¬ 
tively). However, when the server receives this segment, it has no information 
whatsoever about the connection and therefore replies with an RST segment. This 
is no problem for the server, but it causes the client to prematurely transition from 
TIME_WAIT to CLOSED. Most systems avoid this problem by simply not reacting 
to reset segments while in the TIME_WAIT state. 


13.7 TCP Server Operation 

We said in Chapter 1 that most TCP servers are concurrent. When a new con¬ 
nection request arrives at a server, the server accepts the connection and invokes 
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a new process or thread to handle the new client. Depending on the operating 
system, various other resources may be allocated to invoke the new server. We are 
interested in the interaction of TCP with concurrent servers. In particular, we wish 
to become familiar wifh how TCP servers use porf numbers and how mulfiple 
concurrenf clienfs are handled. 

13.7.1 TCP Port Numbers 

We can see how TCP handles porf numbers by wafching any TCP server. We shall 
wafch fhe secure shell server (called sshd) using fhe netstat command on a 
dual-sfack IPv4/IPv6-capable hosf. The sshd applicafion implemenfs fhe Secure 
Shell Protocol [RFC4254], which provides an encrypfed and aufhenficafed remofe 
ferminal capabilify. The following oufpuf is on a sysfem wifh no acfive secure 
shell connecfions. (We have delefed all of fhe oufpuf lines excepf fhe one associ- 
afed wifh fhe server.) 


Linux% netstat -a -n -t 

Active Internet connections (servers and established) 

Proto Recv-Q Send-Q Local Address Foreign Address State 
tcp 0 0 :::22 :::* LISTEN 

The -a opfion reporfs on all nefwork endpoinfs, including fhose in eifher lis- 
fening or non-lisfening sfafe. The -n flag prinfs IP addresses as doffed-decimal (or 
hex) numbers, insfead of frying fo use fhe DNS fo converf fhe address fo a name, 
and prinfs numeric porf numbers (e.g., 22) insfead of service names (e.g., ssh). 
The -t opfion selecfs only TCP endpoinfs. 

The local address (which really means local endpoinf) is oufpuf as :: :22, 
which is fhe IPv6-orienfed way of referring fo fhe all-zeros address, also called 
fhe wildcard address, along wifh porf number 22. This means fhaf an incoming 
connecfion requesf (i.e., a SYN) fo porf 22 will be accepfed on any local inferface. 
If fhe hosf were mulfihomed (fhis one is), we could specify a single IP address for 
fhe local IP address (one of fhe hosf's IP addresses), and only connecfions received 
on fhaf inferface would be accepfed. (We will see an example of fhis lafer in fhis 
secfion.) Porf 22 is fhe well-known porf number reserved for fhe Secure Shell Pro¬ 
tocol. Ofher porf numbers are mainfained by fhe lANA [ITP]. 

The foreign address is oufpuf as ::: which means bofh a wildcard address 
and porf number (i.e., if represenfs a wildcard endpoinf). Here, fhe foreign IP 
address and foreign porf number are nof known yef, because fhe local endpoinf is 
in fhe LISTEN sfafe, waifing for a connecfion fo arrive. We now sfarf a secure shell 
clienf on fhe hosf 10.0.0.3 fhaf connecfs fo fhis server. Here are fhe relevanf lines 
from fhe netstat oufpuf (fhe Recv-Q and Send-Q columns, which confain only 
values of zero, have been removed for clarify): 

Linux% netstat -a -n -t 

Active Internet connections (servers and established) 
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Proto Local Address Foreign Address State 

tcp :::22 :::* LISTEN 

tcp :0.0.1:22 ::ffff:10.0.0.3:16137 ESTABLISHED 

The second line for port 22 is the ESTABLISHED connection. All four elements 
of the local and foreign endpoints are filled in for this connection: the local IP 
address and port number, and the foreign IP address and port number. The local IP 
address corresponds to the interface on which the connection request arrived (the 
Ethernet interface, identified by its IPv4-mapped IPv6 address, ::ffff:10.0.0.1). 

The local endpoint in the LISTEN state is left alone. This is the endpoint that 
the concurrent server uses to accept future connection requests. It is the TCP mod¬ 
ule in the operating system that creates the new endpoint in the ESTABLISHED 
state, when the incoming connection request arrives and is accepted. Also notice 
that the port number for the ESTABLISHED connection does not change: it is 22, 
the same as the LISTEN endpoint. 

We now initiate another client request from the same system (10.0.0.3) to this 
server. Here is the relevant netstat output: 

Linux% netstat -a -n -t 

Active Internet connections {servers and established) 


Proto 

Local Address 

Foreign Address 

State 

tcp 

: : :22 

... -A- 

LISTEN 

tcp 

:0.0.1:22 : 

; :ffff:10.0.0.3:16140 

ESTABLISHED 

tcp 

::ffff:10.0.0.1:22 : 

: :ffff:10.0.0.3:16137 

ESTABLISHED 


We now have two ESTABLISHED connections from the same host to the same 
server. Both have a local port number on the server of 22. This is not a problem 
for TCP because the foreign port numbers are different. They must be different 
because each of the secure shell clients uses an ephemeral port, and the definition 
of an ephemeral port is one that is not currently in use on that host (10.0.0.3). 

This example reiterates, yet again, that TCP demultiplexes incoming segments 
using all four values that constitute the local and foreign endpoints: destination 
IP address, destination port number, source IP address, and source port number. 
TCP cannot determine which process gets an incoming segment by looking at the 
destination port number only. Also, the only one of the three endpoints at port 
22 that will receive incoming connection requests is the one in the LISTEN state. 
The endpoints in the ESTABLISHED state cannot receive SYN segments, and the 
endpoint in the LISTEN state cannot receive data segments. The host operating 
system ensures this. (If it did not, TCP could become quite confused and not work 
properly.) 

Next we initiate a third client connection, from the IP address 169.229.62.97 
that is across the DSL PPPoE link from the server 10.0.0.1, and not on the same 
Ethernet. (The output below has the Proto column removed, which contains only 
tcp, for clarity.) 
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Linux% netstat -a -n -t 

Active Internet connections (servers and established) 
Send-Q Local Address Foreign Address 

0 : : :22 : : :* 
0 ::ffff:10.0.0.1:22 ::ffff:10.0.0.3:16140 

0 ::ffff:10.0.0.1:22 ::ffff:10.0.0.3:16137 

928 ::ffff:67.125.227.195:22 ::ffff:169.229.62.97:1473 


State 

LISTEN 

ESTABLISHED 

ESTABLISHED 

ESTABLISHED 


The local IP address of the third ESTABLISHED connection now corresponds to 
the interface address of the PPPoE link on the multihomed host (67.125.227.19 5). 
Note that the Send-Q status is not 0 but is instead 928 bytes. This means that the 
server host has sent 928 bytes on the connection for which it has not yet heard an 
acknowledgment. 


13.7.2 Restricting Locai iP Addresses 

We can see what happens when the server does not wildcard the local IP address 
but instead sets it to one particular local address. If we run our sock program as a 
server and provide it with a particuclar IP address, that address becomes the local 
address of the listening endpoint. For example: 

Linux% sock -s 10.0.0.1 8888 

This restricts this server to using connections that arrive only on the local IPv4 
address 10.0.0.1. The netstat output reflects this: 

Linux% netstat -a -n -t 

Active Internet connections (servers and established) 

Proto Recv-Q Send-Q Local Address Foreign Address State 

tcp 0 0 10.0.0.1:8888 O.O.O.O:* LISTEN 

One thing that is especially interesting about this example is that our sock 
program is binding only to the local IPv4 address 10.0.0.1, so our netstat out¬ 
put looks significantly different. In our previous example, the wildcard address 
and port number indications were across both versions of IP. In this case, how¬ 
ever, we are bound to a particular address, port, and address family (IPv4 only). 
If we now connect to this server from the local network, from the host 10.0.0.3, 
it works fine: 


Linux% netstat -a -n -t 

Active Internet connections (servers and established) 

Proto Recv-Q Send-Q Local Address Foreign Address State 

tcp 0 0 10.0.0.1:8888 O.O.O.O:* LISTEN 

tcp 0 0 10.0.0.1:8888 10.0.0.3:16153 ESTABLISHED 

If we instead try to connect to this server from a host using a destination 
address other than 10.0.0.1 (even including the local address 127.0.0.1), the 
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connection request is not accepted by the TCP module. If we watch with tcp- 
dump, the SYN elicits an RST segment, as we show in Listing 13-9. 


Listing 13-9 Rejection of a connection request based on local IP address of server 

1 22:29:19.905593 IP 127.0.0.1.1292 > 127.0.0.1.8888: 

S 591843787:591843787(0) win 32767 

<mss 16396,sackOK,timestamp 3587463952 0,nop,wscale 2> 

2 22:29:19.906095 IP 127.0.0.1.8888 > 127.0.0.1.1292: 

R 0:0(0) ack 591843788 win 0 


The server application never sees the connection request—the rejection is 
done by the operating system's TCP module, based on the local address specified 
by the application and the destination address contained in the arriving SYN seg¬ 
ment. We see that the capability of restricting local IP addresses is quite strict. 

13.7.3 Restricting Foreign Endpoints 

In Chapter 10, we saw that a UDP server can normally specify the foreign IP address 
and foreign port number, in addition to specifying the local IP address and local 
port number. The abstract interface functions for TCP given in [RFC0793] allow a 
server doing a passive open to have either a fully specified foreign endpoint (to 
wait for a particular client to issue an active open) or an unspecified foreign end¬ 
point (to wait for any client). 

Unfortunately, the ordinary Berkeley sockets API does not provide a way to 
do this. The server must leave the client's endpoint unspecified, wait for the con¬ 
nection to arrive, and then examine the IP address and port number of the client. 
Table 13-3 summarizes the three types of address bindings that a TCP server can 
establish. 


Table 13-3 Address and port number binding options available to a TCP server 


Local Address 

Foreign Address 

Restricted to 

Comment 

local_lP. Iport 

foraddr. foreign_port 

One client 

Not usually supported 

local_lP. Iport 

jF * 

One local 
endpoint 

Unusual (used by DNS 
servers) 

*.local_port 

jF 5F 

One local port 

Most common; multiple 
address families (IPv4/IPv6) 
may be supported 


In all cases, local_port is the server's assigned port and local_IP must 
be a unicast IP address used by the local system. The ordering of the three rows 
in the table is the order that the TCP module applies when trying to determine 
which local endpoint receives an incoming connection request. The most specific 
binding (the first row, if supported) is tried first, and the least specific (the last row 
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with both IP addresses wildcarded) is tried last. For systems supporting IPv4 and 
IPv6 ("dual-stack"), the port space may be combined. In essence, this means that 
writing a server that binds to a port using IPv6 addressing is also bound to the 
same port in IPv4. 

13.7.4 Incoming Connection Queue 

A concurrent server invokes a new process or thread to handle each client, so 
the listening server should always be ready to handle the next incoming connec¬ 
tion request. That is the underlying reason for using concurrent servers. But there 
is still a chance that multiple connection requests will arrive while the listening 
server is creating a new process, or while the operating system is busy running 
other higher-priority processes, or worse yet, that the server is being attacked with 
bogus connection requests that are never allowed to be established. How does 
TCP handle these scenarios? 

To fully explore fhis quesfion, we musf firsf undersfand fhaf new connecfions 
may be in one of fwo disfincf sfafes before fhey are made available fo an applica- 
fion. The firsf case is connecfions fhaf have nof yef complefed buf for which a 
SYN has been received (fhese are in fhe SYN_RCVD sfafe). The second case is 
connecfions fhaf have already complefed fhe fhree-way handshake and are in fhe 
ESTABLISHED sfafe buf have nof yef been accepfed by fhe applicafion. Infernally, 
fhe operafing sysfem ordinarily has fwo disfincf connecfion queues, one for each 
of fhese cases. 

An applicafion has limifed confrol over fhe sizing of fhese queues. Tradifion- 
ally, using fhe Berkeley sockefs API, an applicafion had only indirecf confrol of fhe 
sum of fhe sizes of fhese fwo queues. In modern Linux kernels fhis behavior has 
been changed fo be fhe number of connecfions in fhe second case (ESTABLISHED 
connecfions). The applicafion can fherefore limif fhe number of fully formed con¬ 
necfions waifing for if fo handle. In Linux, fhen, fhe following rules apply: 

1. When a connecfion requesf arrives (i.e., fhe SYN segmenf), fhe sysfem-wide 
paramefer net.ipv4.tcp_max_syn_backlog is checked (defaulf 1000). 
If fhe number of connecfions in fhe SYN_RCVD sfafe would exceed fhis 
fhreshold, fhe incoming connecfion is rejecfed. 

2. Each lisfening endpoinf has a fixed-lengfh queue of connecfions fhaf have 
been complefely accepfed by TCP (i.e., fhe fhree-way handshake is com- 
plefe) buf nof yef accepfed by fhe applicafion. The applicafion specifies 
a limif fo fhis queue, commonly called fhe backlog. This backlog musf be 
befween 0 and a sysfem-specific maximum called net. core. somaxconn, 
inclusive (defaulf 128). 

Keep in mind fhaf fhis backlog value specifies only fhe maximum number 
of queued connecfions for one lisfening endpoinf, all of which have already 
been accepfed by TCP and are waifing fo be accepfed by fhe applicafion. 



Section 13.7 TCP Server Operation 


637 


This backlog has no effect whatsoever on the maximum number of estab¬ 
lished connections allowed by the system, or on the number of clients that 
a concurrent server can handle concurrently. 

3. If there is room on this listening endpoint's queue for this new connection, 
the TCP module ACKs the SYN and completes the connection. The server 
application with the listening endpoint does not see this new connection 
until the third segment of the three-way handshake is received. Also, the 
client may think the server is ready to receive data when the client's active 
open completes successfully, before the server application has been noti¬ 
fied of the new connection. If this happens, the server's TCP just queues the 
incoming data. 

4. If there is not enough room on the queue for the new connection, the TCP 
delays responding to the SYN, to give the application a chance to catch 
up. Linux is somewhat unique in this behavior—it persists in not ignoring 
incoming connections if it possibly can. If the net. ipv4. tcp_abort_on_ 
overflow system control variable is set, new incoming connections are 
reset with a reset segment. 

Sending reset segments on overflow is not generally advisable and is not 
turned on by default. The client has attempted to contact the server, and if it 
receives a reset during the SYN exchange, it may falsely conclude that no server is 
present (instead of concluding that there is a server present but it is busy). Being 
too busy is really a form of "soft" or temporary error rather than a hard error. 
Normally, when the queue is full, the application or the operating system is busy, 
preventing the application from servicing incoming connections. This condition 
could change in a short while. But if the server's TCP responded with a reset, 
the client's active open would abort (which is what we saw happen if the server 
was not started). Without the reset, if the listening server does not get around to 
accepting some of the already-accepted connections that have filled its queue to 
the limit, the client's active open eventually times out, according to normal TCP 
mechanisms. In the case of Linux, the connecting clients are just slowed for a sig¬ 
nificant period of time—they will neither time out nor be reset. 

We can see what happens when the incoming connection queue becomes full 
using our sock program. We invoke it with a new option (-0) that tells it to pause 
after creating the listening endpoint, before accepting any connection requests. 
If we then invoke multiple clients during this pause period, the server's queue of 
accepted connections should fill, and we can see what happens with tcpdump. 


Linux% sock -s -v -ql -030000 6666 


The -ql option sets the backlog of the listening endpoint to 1. The -030000 
option causes the program to sleep for 30,000s (basically a long time, about 8 
hours) before accepting any client connections. If we now try to connect to this 
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server continually, the first four connecfions are complefed immediafely. Affer 
fhaf, fwo connecfions are complefed every 9s. Ofher operafing sysfems vary con¬ 
siderably in how fhis is handled. In Solaris 8 and FreeBSD 4.7, for example, fwo 
connecfions are handled immediafely and fhe fhird fimes ouf; subsequenf con¬ 
necfions fime ouf as well. 

Lisfing 13-10 shows fhe tcpdump oufpuf of a Linux clienf connecfing fo a 
FreeBSD server running fhe sock program wifh fhe argumenfs jusf given. (We 
have marked fhe clienf porf numbers in bold when fhe TCP connecfion is esfab- 
lished—^when fhe fhree-way handshake complefes.) 


Listing 13-10 The FreeBSD server accepts two connections immediately. Subsequent connections 
receive no response and eventually time out at the client. 

1 21:28:47.399872 IP (tos 0x0, ttl 64, id 46646, offset 0, 

flags [DF], proto 6, length: 60) 

63.203.76.212.2461 > 169.229.62.97.6666: 

S [top sum ok] 2998137201:2998137201(0) win 5808 
<mss 1452,sackOK,timestamp 4102309703 0,nop,wscale 2> 

2 21:28:47.413770 IP (tos 0x0, ttl 47, id 6876, offset 0, 

flags [DF], proto 6, length: 60) 

169.229.62.97.6666 > 63.203.76.212.2461: 

S [top sum ok] 5583769:5583769(0) ack 2998137202 win 1460 
<mss 1412,nop,wscale 0,nop,nop,timestamp 219082980 4102309703> 

3 21:28:47.414058 IP (tos 0x0, ttl 64, id 46648, offset 0, 

flags [DF], proto 6, length: 52) 

63.203.76.212.2461 > 169.229.62.97.6666: 

. [top sum ok] 1:1(0) ack 1 win 1452 
<nop,nop,timestamp 4102309717 219082980> 

4 21:28:47.423673 IP (tos 0x0, ttl 64, id 19651, offset 0, 

flags [DF], proto 6, length: 60) 

63.203.76.212.2462 > 169.229.62.97.6666: 

S [tcp sum ok] 2996964252:2996964252(0) win 5808 
<mss 1452,sackOK,timestamp 4102309727 0,nop,wscale 2> 

5 21:28:47.436897 IP (tos 0x0, ttl 47, id 26581, offset 0, 

flags [DF], proto 6, length: 60) 

169.229.62.97.6666 > 63.203.76.212.2462: 

S [tcp sum ok] 3761536245:3761536245(0) ack 2996964253 win 1460 
<mss 1412,nop,wscale 0,nop,nop,timestamp 219082983 4102309727> 

6 21:28:47.437186 IP (tos 0x0, ttl 64, id 19653, offset 0, 

flags [DF], proto 6, length: 52) 

63.203.76.212.2462 > 169.229.62.97.6666: 

. [tcp sum ok] 1:1(0) ack 1 win 1452 
<nop,nop,timestamp 4102309741 219082983> 

7 21:28:47.446198 IP (tos 0x0, ttl 64, id 24292, offset 0, 

flags [DF], proto 6, length: 60) 
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63.203.76.212.2463 > 169.229.62.97.6666: 

S [tcp sum ok] 2991331729:2991331729(0) win 5808 
<mss 1452,sackOK,timestamp 4102309749 0,nop,wscale 2> 

8 21:28:50.445771 IP (tos 0x0, ttl 64, id 24294, offset 0, 

flags [DF], proto 6, length: 60) 

63.203.76.212.2463 > 169.229.62.97.6666: 

S [tcp sum ok] 2991331729:2991331729(0) win 5808 
<mss 1452,sackOK,timestamp 4102312750 0,nop,wscale 2> 

9 21:28:56.444900 IP (tos 0x0, ttl 64, id 24296, offset 0, 

flags [DF], proto 6, length: 60) 

63.203.76.212.2463 > 169.229.62.97.6666: 

S [tcp sum ok] 2991331729:2991331729(0) win 5808 
<mss 1452,sackOK,timestamp 4102318750 0,nop,wscale 2> 

10 21:29:08.443031 IP (tos 0x0, ttl 64, id 24298, offset 0, 

flags [DF], proto 6, length: 60) 6 

3.203.76.212.2463 > 169.229.62.97.6666: 

S [tcp sum ok] 2991331729:2991331729(0) win 5808 
<mss 1452,sackOK,timestamp 4102330750 0,nop,wscale 2> 

11 21:29:32.439406 IP (tos 0x0, ttl 64, id 24300, offset 0, 

flags [DF], proto 6, length: 60) 

63.203.76.212.2463 > 169.229.62.97.6666: 

S [tcp sum ok] 2991331729:2991331729(0) win 5808 
<mss 1452,sackOK,timestamp 4102354750 0,nop,wscale 2> 

12 21:30:20.432118 IP (tos 0x0, ttl 64, id 24302, offset 0, 

flags [DF], proto 6, length: 60) 

63.203.76.212.2463 > 169.229.62.97.6666: 

S [tcp sum ok] 2991331729:2991331729(0) win 5808 
<mss 1452,sackOK,timestamp 4102402750 0,nop,wscale 2> 


The first client's connection request from port 2461 is accepted by TCP (seg¬ 
ments 1-3). The second client's connection request from port 2462 is also accepted 
by TCP (segments 4-6). The server application is still asleep and has not accepted 
either connection yet. Everything has been done by the TCP module in the oper¬ 
ating system. Also, the two clients have returned successfully from their active 
opens, because the three-way handshakes are complete. 

We try to start a third whose SYN appears as segment 7 (port 2463), but the 
server-side TCP ignores the SYNs because the queue for this listening endpoint 
is full. The client retransmits its SYN in segments 8-12 using binary exponential 
backoff. In FreeBSD and Solaris, TCP ignores the incoming SYN when the queue 
is full. 

Recall that TCP accepts an incoming connection request (i.e., a SYN) if there is 
room on the listener's queue, without giving the application a chance to see where 
it is from (the source IP address and source port number). This is not required 
by TCP; it is just the common implementation technique (i.e., the way Berkeley 
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sockets work). If an alternative to the Berkeley sockets API were used (e.g., TLI/ 
XTI), the application could be provided a way to learn when a connection request 
arrives and then allow the application to choose whether to accept the connection 
or not. Even though TLI provided this capability in theory, it never fully caught 
on, so we are effectively left with the TCP interface provided by Berkeley sockets. 

So with Berkeley sockets, be aware that with TCP, when the application is told 
that a connection has just arrived, TCP's three-way handshake is already over. This 
behavior also means that a TCP server has no way to cause a client's active open to 
fail. When a new client connection is passed to the server application, TCP's three- 
way handshake is over, and the client's active open has completed successfully. If 
the server then looks at the client's IP address and port number and decides it does 
not want to service this client, all the server can do is either close the connection 
(causing a FIN to be sent) or reset the connection (causing an RST to be sent). In 
either case the client thought everything was OK when its active open completed 
and may have already sent a request to the server. Other transport-layer proto¬ 
cols may be implemented that provide this separation to the application between 
arrival and acceptance (i.e., the OSI transport layer), but not TCP. 


13.8 Attacks Involving TCP Connection Management 

A SYN flood is a TCP DoS attack whereby one or more malicious clients generate 
a series of TCP connection attempts (SYN segments) and send them at a server, 
often with a "spoofed" (e.g., random) source IP address. The server allocates some 
amount of connection resources to each partial connection. Because the connec¬ 
tions are never established, the server may start to deny service to future legiti¬ 
mate requests because its memory is exhausted holding state for many half-open 
connections. 

This attack is somewhat difficult to repel, because it is not always easy to dis¬ 
tinguish between legitimate connection attempts and SYN floods. One mecha¬ 
nism invented to deal with this issue is called SYN cookies [RFC4987]. The main 
insight with SYN cookies is that most of the information that would be stored for a 
connection when a SYN arrives could be encoded inside the Sequence Number field 
supplied with the SYN + ACK. The target machine using SYN cookies need not 
allocate any storage for the incoming connection request—it allocates real memory 
only once the SYN + ACK segment has itself been acknowledged (and the initial 
sequence number is returned). In that case, all the vital connection parameters can 
be recovered and the connection can be placed in the ESTABLISHED state. 

Producing SYN cookies involves a careful selection process of the TCP ISN 
at servers. Essentially, the server must encode any essential state in the Sequence 
Number field in its SYN + ACK that is returned in the ACK Number field from a 
legitimate client. There are several ways of doing this, but we will mention the 
technique adopted by Linux. 
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A server receiving an incoming SYN causes its ISN (supplied to the client in 
the SYN + ACK segment) to be set to a value constructed in the following way: the 
top 5 bits are (f mod 32) where f is a 32-bit counter that increases by 1 every 64s, 
the next 3 bits are an encoding of fhe server's MSS (one of eighf possibilifies), and 
fhe remaining 24 bifs are a server-selecfed cryptographic hash of fhe connecfion 
4-fuple and t value. (See Chapfer 18 for a defailed explanafion of cryptographic 
hashes.) 

When SYN cookies are used, fhe server always responds wifh a SYN + ACK (as 
wifh any fypical TCP connecfion esfablishmenf), and fhe server is able to rebuild 
ifs queue of arriving SYNs when if receives a legifimafe ACK where fhe value for 
t sfill produces fhe same oufpuf from fhe crypfographic hash. There are af leasf 
fwo piffalls of fhis approach. Firsf, fhe scheme prohibifs fhe use of arbifrary-size 
segmenfs because of fhe encoding of fhe MSS. Second, and much less serious, 
connecfion esfablishmenf cycles fhaf are very long (longer fhan 64s) do nof work 
properly because fhe counter would wrap. For fhese reasons, fhis funcfion is nof 
enabled by defaulf. 

Anofher type of degradafion affack on TCP involves PMTUD. In fhis case, an 
affacker fabricates an ICMP PTB message confaining a very small MTU value (e.g., 
68 byfes). This forces fhe vicfim TCP to affempf to fif ifs dafa info very small pack- 
efs, greafly reducing ifs performance. This problem can be addressed in several 
ways. The mosf brufe-force way would be to simply disable PMTUD for fhe hosf. 
Ofher opfions would be fo disable PMTUD in cases where an ICMP PTB message 
wifh nexf-hop MTU under 576 byfes is received. Anofher opfion, implemented 
by Linux and menfioned briefly earlier, is fo insisf fhaf fhe minimum packef size 
(for large packefs used by TCP) always be fixed af some value, and larger packefs 
simply nof have fhe IPv4 DF bif field fumed on. This approach is similar, alfhough 
perhaps somewhaf more affracfive, fhan completely disabling PMTUD. 

Anofher type of affack involves disrupfing an exisfing TCP connecfion and 
possibly faking if over (called hijacking). These forms of affacks usually involve a 
firsf sfep of "desynchronizing" fhe fwo TCP endpoinfs so fhaf if fhey were fo falk 
fo each ofher, fhey would be using invalid sequence numbers. They are parficu- 
lar examples of sequence number attacks [RFC1948]. They can be accomplished in 
af leasf fwo ways: by causing invalid sfafe fransifions during connecfion esfab¬ 
lishmenf (similar fo TWA; see Secfion 13.6.4), and by generafing exfra dafa white 
in fhe ESTABLISHED sfafe. Once fhe fwo ends can no longer communicate (buf 
believe fhey have an open connecfion), an affacker can infroduce fraffic info fhe 
connecfion, which is considered (by TCP af leasf) as valid. 

A collecfion of affacks generally called spoofing attacks involve TCP segmenfs 
fhaf have been specially faltered by an affacker fo disrupf or alter fhe behavior of 
an exisfing TCP connecfion. A variefy of fhese affacks and fheir mifigafion fech- 
niques are discussed in [RFC4953]. An affacker can generafe a spoofed resef seg- 
menf and send if fo an exisfing TCP endpoinf. Provided fhe connecfion 4-fupte 
and checksum are correcf, and fhe sequence number is in range, fhe resef gener¬ 
ally resulfs in a connecfion aborf af eifher endpoinf. This is of growing concern 
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because as networks become faster, a wider range of sequence numbers are con¬ 
sidered "in window" fo mainfain performance (see Chapfer 15). Ofher fypes of 
segmenfs (SYNs, even ACKs) can also be spoofed (and combined wifh flooding 
attacks), causing myriad problems. Mifigafion fechniques include aufhenficafing 
each segmenf (e.g., using fhe TCP-AO opfion), requiring resef segmenfs fo have 
one parficular sequence number insfead of one from a range, requiring parficular 
values in fhe Timesfamps opfion, and using ofher forms of cookies in which ofher- 
wise noncrifical dafa values are arranged fo depend on more exacf knowledge of 
fhe connecfion or a secref value. 

There are spoofing affacks fhaf are nof parf of fhe TCP protocol yef can affecf 
TCP's operafion. For example, ICMP can be used fo modify PMTUD behavior. 
If can also be used fo indicafe fhaf a porf or hosf is nof available, and fhis often 
causes a TCP connecfion fo be ferminafed. Many of fhese affacks are described in 
[RFC5927], which also suggesfs a number of ways of improving robusfness againsf 
spoofed ICMP messages. The suggesfions amounf fo validafing nof only fhe ICMP 
message buf also as much of fhe confained TCP segmenf as possible. For example, 
fhe confained segmenf should have an appropriafe 4-fuple and sequence number. 


13.9 Summary 

Before fwo processes can exchange dafa using TCP, fhey musf esfablish a connec¬ 
fion befween fhemselves. When fhey are done, fhey ferminafe fhe connecfion. This 
chapfer has provided a defailed look af how connecfions are esfablished using a 
fhree-way handshake, and how fhey are ferminafed using four segmenfs. We also 
saw how TCP can handle simulfaneous open and close operafions and how vari¬ 
ous opfions, including fhe Selecfive ACK, Timesfamps, MSS, TCP-AO, and UTO 
opfions, are handled. 

We used tcpdump and Wireshark fo show TCP's behavior and ifs use of fhe 
fields in fhe TCP header. We also saw how connecfion esfablishmenf can fime ouf, 
how resefs are senf and inferprefed, whaf happens wifh a half-open connecfion, 
and how TCP provides a half-close. TCP bounds bofh fhe number of connecfion 
affempfs if will fry when performing an acfive open and also fhe number of con¬ 
necfion affempfs if will service affer performing a passive open. 

Fundamenfal fo undersfanding fhe operafion of TCP is ifs sfafe fransifion 
diagram. We followed fhrough fhe sfeps involved in connecfion esfablishmenf 
and ferminafion, and fhe sfafe fransifions fhaf fake place. We also looked af fhe 
implicafions of TCP's connecfion esfablishmenf for fhe design of concurrenf TCP 
servers. 

A TCP connecfion is uniquely defined by a 4-fuple: fhe local IP address, local 
porf number, foreign IP address, and foreign porf number. Whenever a connec¬ 
fion is ferminafed, one end musf mainfain knowledge of fhe connecfion, and we 
saw fhaf fhe TIME_WAIT sfafe handles fhis. The rule is fhaf fhe end fhaf does 
fhe acfive close enters fhis sfafe for fwice fhe implemenfafion's MSL, which helps 
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protect TCP from processing segments from an older instantiation of the same 
connection. Using the Timestamps option can shorten the waiting time when new 
connections attempt to use the same 4-tuple, and it has other benefits for detecting 
wrapped sequence numbers and performing better RTT measurements. 

TCP can be vulnerable to both resource exhaustion and spoofing attacks, but 
a number of methods have been developed to resist such issues. In addition, TCP 
can be affected by other protocols such as ICMP. Additional protection for ICMP 
is possible by carefully processing the original datagram returned by ICMP mes¬ 
sages. Finally, TCP can be used in combination with protocols that provide secu¬ 
rity at other layers (e.g., IPsec and TLS/SSL, described in Chapter 18), which is 
now standard practice. 
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TCP Timeout and 
Retransmission 


14.1 Introduction 

Efficiency and performance are issues that we have not discussed much so far. 
We have primarily been concerned with correctness of operation. In this chapter 
and the next two, we will be focusing not only on the basic tasks TCP performs, 
but also on how well it performs them. The TCP protocol provides a reliable data 
delivery service between two applications using an underlying network layer 
(IP) that may lose, duplicate, or reorder packets. In order to provide an error-free 
exchange of data, TCP resends data it believes has been lost. To decide what data 
it needs to resend, TCP depends on a continuous flow of acknowledgments from 
receiver to sender. When data segments or acknowledgments are lost, TCP initi¬ 
ates a retransmission of the data that has not been acknowledged. TCP has two 
separate mechanisms for accomplishing retransmission, one based on time and 
one based on the structure of the acknowledgments. The second approach is usu¬ 
ally much more efficient than the first. 

TCP sets a timer when it sends data, and if the data is not acknowledged when 
the timer expires, a timeout or timer-based retransmission of data occurs. The time¬ 
out occurs after an interval called the retransmission timeout (RTO). It has another 
way of initiating a retransmission called/flst retransmission or fast retransmit, which 
usually happens without any delay. Fast retransmit is based on inferring losses by 
noticing when TCP's cumulative acknowledgment fails to advance in the ACKs 
received over time, or when ACKs carrying selective acknowledgment informa¬ 
tion (SACKs) indicate that out-of-order segments are present at the receiver. Gen¬ 
erally speaking, when the sender believes that the receiver might be missing some 
data, a choice needs to be made between sending new (unsent) data and retrans¬ 
mitting. In this chapter we look closely at how TCP determines that a segment 
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is lost and what to send in response. The issue of how much to send is deferred 
unfil Chapfer 16, where we discuss TCP's congesfion confrol procedures fhaf are 
commonly invoked when packef loss is suspecfed. Here, we invesfigafe how fhe 
RTO is sef based on measuremenfs of a connecfion's round-frip fime (RTT), fhe 
mechanics of a fimer-based refransmission, and how TCP's fasf refransmission 
mechanism works. We also look af how SACKs are used fo help a TCP sender 
defermine whaf dafa is missing af fhe receiver, fhe effecf of reordering and dupli- 
cafion of IP packefs on TCP's behavior, and fhe way TCP can change ifs packef size 
when refransmiffing. We also look briefly af some affacks fhaf can be mounfed fo 
fool TCP info behaving more aggressively or more passively. 


14.2 Simple Timeout and Retransmission Example 

We have already seen some examples of fimeouf and refransmission. (1) In fhe 
ICMP Desfinafion Unreachable (Porf Unreachable) example in Chapfer 8 we saw 
fhe TFTP clienf using UDP employing a simple (and poor) fimeouf and refrans¬ 
mission sfrafegy: if assumed 5s was an adequafe fimeouf period and refransmif- 
fed every 5s. (2) In fhe affempfed connecfion fo a nonexisfenf hosf in Chapfer 13, 
we saw fhaf when TCP fried fo esfablish fhe connecfion if refransmiffed ifs SYN 
segmenf using a longer and longer delay befween each successive refransmission. 
(3) In Chapfer 3, we saw whaf happens when Efhernef encounfers a collision. All 
of fhese mechanisms are inifiafed by fhe expirafion of a fimer. 

We shall firsf look af fhe fimer-based refransmission sfrafegy used by TCP We 
will esfablish a connecfion, send some dafa fo verify fhaf everyfhing is OK, isolafe 
one end of fhe connecfion, send some more dafa, and wafch whaf TCP does. In fhis 
case, we will use Wireshark fo see how fhe connecfion progresses (see Figure 14-1). 

Segmenfs 1, 2, and 3 correspond fo fhe normal TCP connecfion esfablish- 
menf handshake. When fhe Web server complefes fhe connecfion esfablishmenf, 
if remains silenf, awaifing a Web requesf. Before we provide fhe requesf, we isolafe 
(disconnecf) fhe server hosf. The inpuf af fhe clienf side is as follows: 

Linux% telnet 10.0.0.10 80 

Trying 10.0.0.10... 

Connected to 10.0.0.10. 

Escape character is . 

GET / HTTP/1.0 

Connection closed by foreign host. 


This requesf cannof be delivered fo fhe server, so if remains in TCP's queue af 
fhe clienf for quife some fime. During fhis period, fhe netstat command on fhe 
clienf indicafes fhaf fhe queue is nof empfy: 


Active Internet connections (w/o servers) 

Proto Recv-Q Send-Q Local Address Foreign Address State 

tcp 0 18 10.0.0.9:1043 10.0.0.10:www ESTABLISHED 
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a Frame 5: 82 

bytes on wire (656 bits). 

82 bytes captured (656 bits) 



a Ethernet II, 

src: 00:a0 

cc:63:3b:ce (00:a0: 

cc:63: 

3b:ce), Dst: 00:06:5b:0e:81:8c (00:06:5b:0e:81:8c) 



a Internet Protocol, src: 

10.0.0.9 (10.0.0.9) 

, Dst: 

10.0.0.10 (10.0.0.10) 




a Transmission control Protocol, src Port: 1043 (1043), Dst Port: 80 (80), Seq: 1, Ack: 1, Len: 16 
source port: 1043 (1043) 

Destination port: 80 (80) 

[stream index: 0] 

sequence number: 1 (relative sequence number) 

[Next sequence number: 17 (relative sequence number)] 

Acknowledgement number: 1 (relative ack number) 

Header length: 32 bytes 
a Flags: 0x18 (psh, ack) 
window size: 5840 
a checksum: 0x1464 [correct] 
a options: (12 bytes) 
a [SEQ/ACK analysis] 

[Number of bytes in flight: 16] 
a [TCP Analysis Flags] 

a [This frame is a (suspected) retransmission] 

[The RTO for this segment was: 0.206642000 seconds] 
fRTO based on delta from frame: 41 
a [Timestamps] 
a Data (16 bytes) 


Figure 14-1 A simple example of TCP's timeout and retransmission mechanism. The first retransmit occurs at time 42.954, followed 
by other retransmissions at times 43.374, 44.215, 45.895, and 49.255. The intervals between successive retransmissions 
are 206ms, 420ms, 841ms, 1.68s, and 3.36s, respectively. These times represent a doubling of the timeout between suc¬ 
cessive retransmissions of the same segment. 
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Here we see that 18 bytes are in the send queue, waiting to be delivered to 
the Web server. The 18 bytes consist of the characters displayed in the preceding 
request, plus two sets of carriage-refurn and newline characfers. Defails of fhe 
resf of fhe oufpuf, including addresses and sfafe informafion, are described in fhe 
following paragraphs. 

Segmenf 4 is fhe clienf's firsf affempf fo send fhe Web requesf, af 42.748s. The 
nexf fry is af 42.954, 0.206s lafer. Then if launches anofher fry af 43.374, which 
is 0.420s lafer. Addifional refries (refransmissions) occur af 44.215, 45.895, and 
49.255s. These represenf fime differences of 0.841,1.680, and 3.360s, respecfively. 

This doubling of fime befween successive refransmissions is called a binary 
exponential backoff, and we saw if in Chapfer 13 during a failed TCP connecfion 
esfablishmenf affempf. We shall explore if in more defail lafer. If we measure fhe 
elapsed fime befween fhe inifial requesf and fhe fime af which fhe connecfion is 
finally aborfed, fhe fofal fime is abouf 15.5 minufes. After fhaf, fhe following error 
message is displayed af fhe clienf: 


Connection closed by foreign host. 


Logically, TCP has fwo fhresholds fo defermine how persisfenfly if will affempf 
fo resend fhe same segmenf. These fhresholds are described in fhe Hosf Require- 
menfs RFC [RFC1122], and we menfioned fhem briefly in Chapfer 13. Threshold R1 
indicafes fhe number of fries TCP will make (or fhe amounf of fime if will waif) fo 
resend a segmenf before passing "negafive advice" fo fhe IP layer (e.g., causing if fo 
reevaluafe fhe IP roufe if is using). Threshold R2 (larger fhan Rl) dicfafes fhe poinf 
af which TCP should abandon fhe connecfion. These fhresholds are suggesfed fo 
be af leasf fhree refransmissions and 100s, respecfively. For connecfion esfablish¬ 
menf (sending SYN segmenfs), fhese values may be differenf from fhose for dafa 
segmenfs, and fhe R2 value for SYN segmenfs is required fo be af leasf 3 minufes. 

In Linux, fhe Rl and R2 values for regular dafa segmenfs are available fo be 
changed by applicafions or can be changed using fhe sysfem-wide configurafion 
variables net.ipv4.tcp_retriesl and net.ipv4.tcp_retries2, respec¬ 
fively. These are measured in fhe number of refransmissions, and nof in unifs of 
fime. The defaulf value for tcp_retries2 is 15, which corresponds roughly fo 
13-30 minufes, depending on fhe connecfion's RTO. The defaulf value for net. 
ipv4.tcp_retriesl is 3. For SYN segmenfs, net.ipv4.tcp_syn_retries 
and net.ipv4.tcp_synack_retries bounds fhe number of refransmissions 
of SYN segmenfs; fheir defaulf value is 5 (roughly 180s). Windows also has a num¬ 
ber of variables fhaf affecf fhe overall behavior of TCP, including values for Rl and 
R2. These are all available by modifying values under fhe following regisfry keys 
[WINREG]: 

HKLM\SYStem\CurrentControlSet\Services\Tcpip\Parameters 
HKLM\SYStem\CurrentControlSet\Services\Tcpip6\Parameters 
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Of immediate interest is the value called TcpMaxDataRetransmissions. 
This corresponds to the value of tcp_retries2 in Linux. It has a default value of 
5. Even in the simple retransmission example we have seen so far, TCP is required 
to assign some timeout value to its retransmission timer to dictate how long it 
should await an ACK for data it sends. If TCP were only ever used in one static 
environment, it would be possible to determine one particular correct value for the 
timeout value. Because TCP needs to operate in a large variety of environments, 
which themselves may change over time, TCP needs to determine this timeout 
value based on the current situation. For example, if a network link failed and traf¬ 
fic were rerouted, the RTT would change (possibly in a major way). In other words, 
TCP needs to dynamically determine its RTO. We consider this problem next. 


14.3 Setting the Retransmission Timeout (RTO) 

Fundamental to TCP's timeout and retransmission procedures is how to set the 
RTO based upon measurement of the RTT experienced on a given connection. 
If TCP retransmits a segment earlier than the RTT, it may be injecting duplicate 
traffic into the network unnecessarily. Conversely, if it delays sending until much 
longer than one RTT, the overall network utilization (and single-connection 
throughput) drops when traffic is lost. Knowing the RTT is made more compli¬ 
cated because it can change over time, as routes and network usage vary. TCP 
must track these changes and modify its timeout accordingly in order to maintain 
good performance. 

Because TCP sends acknowledgments when it receives data, it is possible to 
send a byte with a particular sequence number and measure the time required to 
receive an acknowledgment that covers that sequence number. Each such mea¬ 
surement is called an RTT sample. The challenge for TCP is to establish a good 
estimate for the range of RTT values given a set of samples that vary over time. 
The second step is how to set the RTO based on these values. Getting this "right" 
is very important for TCP's performance. 

The RTT is estimated for each TCP connection separately, and one retransmis¬ 
sion timer is pending whenever any data is in flight that consumes a sequence 
number (including SYN and FIN segments). The proper way to set this timer has 
been a subject of research for years, and improvements are made on an occasional 
basis. In this section, we will explore some of the more important milestones in the 
evolution of the method used to compute the RTO. We begin with the first ("clas¬ 
sic") method, as detailed in [RFC0793]. 

14.3.1 The Classic Method 

The original TCP specification [RFC0793] had TCP update a smoothed RTT estima¬ 
tor (called SRTT) using the following formula: 


SRTT ^ a(SRTT) + (1 - a) RTT 
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Here, SRTT is updated based on both its existing value and a new sample, 
RTT. The constant a is a smoothing or scale factor with a recommended value 
between 0.8 and 0.9. SRTT is updated every time a new measurement is made. 
With the original recommended value for a, if is clear fhaf 80% fo 90% of each new 
esfimafe is from fhe previous esfimafe and 10% fo 20% is from fhe new measure- 
menf. This type of average is also known as an exponentially weighted moving aver¬ 
age (EWMA) or low-pass filter. If is convenienf for implemenfafion reasons because 
if requires only one previous value of SRTT fo be stored in order fo keep fhe run¬ 
ning esfimafe. 

Given fhe esfimafor SRTT, which changes as fhe RTT changes, [RFC0793] rec¬ 
ommended fhaf fhe RTO be sef fo fhe following: 

RTO = min{ubound, max{lbound,{SRTT)^)) 

where (3 is a delay variance facfor wifh a recommended value of 1.3 fo 2.0, abound 
is an upper bound (suggested fo be, e.g., 1 minute), and Wound is a lower bound 
(suggesfed fo be, e.g.. Is) on fhe RTO. We shall call fhis assignmenf procedure fhe 
classic method. If generally resulfs in fhe RTO being sef eifher fo Is, or fo abouf fwice 
SRTT. For relafively sfable disfribufions of fhe RTT, fhis was adequafe. However, 
when TCP was run over nefworks wifh highly variable RTTs (e.g., early packef 
radio nefworks in fhis case), if did nof perform so well. 

14.3.2 The Standard Method 

In [J88], Jacobson detailed problems with the classic method further—basically, 
that the timer specified by [RFC0793] cannot keep up with wide fluctuations in the 
RTT (and in particular, it causes unnecessary retransmissions when the real RTT is 
much larger than expected). Unnecessary retransmissions add to the network load, 
when the network is already loaded, as indicated by the increasing sample RTT. 

To address this problem, the method used to assign the RTO was enhanced 
to accommodate a larger variability in the RTT. This is accomplished by keeping 
track of an estimate of the variability in the RTT measurements in addition to the 
estimate of its average. Setting the RTO based on both a mean and a variability 
estimator provides a better timeout response to wide fluctuations in the round- 
trip times than just calculating the RTO as a constant multiple of the mean. 

Figures 5 and 6 in [J88] show a comparison of the [RFC0793] RTO values for 
some actual round-trip times, versus the RTO calculations we show next, which 
take into account the variability of the round-trip times. If we think of the RTT 
measurements made by TCP as samples of a statistical process, estimating both 
the mean and variance (or standard deviation) helps to make better predictions 
about the possible future values the process may take on. A good prediction for 
the range of possible values for the RTT helps TCP determine an RTO that is nei¬ 
ther too large nor too small in most cases. 
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As described by Jacobson, the mean deviation is a good approximation to the 
standard deviation, but it is easier and faster to compute. Calculating the standard 
deviation requires executing a square root mathematical operation on the vari¬ 
ance, which was considered to be too expensive for a fasf TCP implemenfafion. 
(This is nof fhe whole sfory, really. See fhe fascinafing history of "fhe debate" in 
[G04].) We fherefore need running esfimafes of bofh fhe average as well as fhe 
mean deviafion. This leads to fhe following equafions fhaf are applied fo each RTT 
measuremenf M (called RTT earlier): 

srtt (1 - g){srtt) + (g)M 
rttvar (1 - h){rttvar) + (/z)(IM - srffi) 

RTO = srtt + A{rttvar) 

Here, fhe value srtt effecfively replaces fhe earlier value of SRTT, and fhe value 
rttvar, which becomes an EWMA of fhe mean deviafion, is used instead of [3 fo 
help defermine fhe RTO. This sef of equafions can also be wriffen in a form fhaf 
requires a smaller number of operafions when implemenfed on a convenfional 
compufer: 


Err = M - srtt 
srtt <r- srtt + g{Err) 
rttvar rttvar + h{\Err\ - rttvar) 

RTO = srtt + A{rttvar) 

As suggesfed, srtt is fhe EWMA for fhe mean and rttvar is fhe EWMA for 
fhe absolute error, I Err I. Err is fhe difference befween fhe measured value M and 
fhe currenf RTT esfimafor srtt. Bofh srtt and rttvar are used fo calculafe fhe RTO, 
which varies over fime. The gain g is fhe weighf given fo a new RTT sample M 
in fhe average srtt and is sef fo 1/8. The gain h is fhe weighf given fo a new mean 
deviafion sample (absolufe difference of fhe new sample M from fhe running aver¬ 
age srtt) for fhe deviafion esfimafe rttvar and is sef fo 1/4. The larger gain for fhe 
deviafion makes fhe RTO go up fasfer when fhe RTT changes. The values for y and 
h are chosen as (negafive) powers of 2, allowing fhe overall sef of compufafions fo 
be implemenfed in a compufer using fixed-poinf infeger arifhmefic wifh shiff and 
add operafions insfead of mulfiplies and divides. 


Note 

[J88] specified 2 * rttvar \r\ the calculation of RTO, but after further research, [J90] 
changed the value to 4 * rttvar, which is what appeared in the BSD Net/1 imple¬ 
mentation and ultimately in the standard [RFC6298]. 
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Comparing the classic method with Jacobson's, we see that the calculations of 
the RTT average are similar (a is 1 minus the gain g) but a different gain is used. 
Also, Jacobson's calculation of fhe RTO depends on bofh fhe smoofhed RTT and 
fhe smoofhed deviafion, whereas fhe classic mefhod used a simple mulfiple of fhe 
smoofhed RTT. This is fhe basis for fhe way many TCP implemenfafions compufe 
fheir RTOs fo fhis day, and because of ifs adopfion as fhe basis for [RFC6298] we 
shall call if fhe standard method, alfhough fhere are slighf refinemenfs in [RFC6298], 
which we shall now discuss. 

14.3.2.1 Clock Granularity and RTO Bounds 

TCP has a confinuously running "clock" fhaf is used when faking RTT measure- 
menfs. As wifh inifial sequence numbers, real TCP connecfions do nof sfarf fheir 
clocks af zero and fhe clock does nof have infinife precision. Rafher, fhe TCP clock 
is usually fhe value of a variable fhaf is updafed as fhe sysfem clock advances, nof 
necessarily one-for-one. The lengfh of fhe TCP's clock "fick" is called ifs granular¬ 
ity. Tradifionally, fhis value was relafively large (abouf 500ms), buf more recenf 
implemenfafions use finer-granularify clocks (e.g., 1ms for Linux). 

The granularify can affecf fhe defails of making RTT measuremenfs and also 
how fhe RTO is sef. In [RFC6298], fhe granularify is used fo refine how updafes fo 
fhe RTO are made. In addifion, a lower bound is placed on fhe RTO. The equafion 
used is as follows: 


RTO = maxjsrtf + max(G, A{rttvar)), 1000) 

where G is fhe timer granularify and 1000ms represenfs a lower bound on fhe fofal 
RTO (recommended by rule (2.4) of [RFC6298]). Consequenfly fhe RTO is always af 
leasf Is. An optional upper bound is also allowed, provided if has a value of af leasf 60s. 

14.3.2.2 Initial Values 

We have seen how fhe esfimafors are updafed as fime progresses, buf we also need 
fo know how fo sef fheir inifial values. Before fhe firsf SYN exchange, TCP has no 
good idea whaf value fo use for seffing fhe inifial RTO. If also does nof know whaf 
fo use as fhe inifial values for ifs esfimafors, unless fhe sysfem has provided hinfs 
af fhis informafion (some sysfems cache fhis informafion in fhe forwarding fable; 
see Section 14.9). According fo [RFC6298], fhe inifial seffing for fhe RTO should be 
Is, alfhough 3s is used in fhe evenf of a fimeouf on fhe inifial SYN segmenf. When 
fhe firsf RTT measuremenf M is received, fhe esfimafors are inifialized as follows: 

srtt <— M 
rttvar M/2 

We now have enough defail fo see how fhe esfimafors are inifialized and main- 
fained. The procedures depend on obfaining RTT samples, which would appear fo 
be sfraighfforward. We now look af why fhis mighf nof always be fhe case. 
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14.3.2.3 Retransmission Ambiguity and Karn’s Aigorithm 

A problem measuring an RTT sample can occur when a packet is retransmitted. 
Say a packet is transmitted, a timeout occurs, the packet is retransmitted, and an 
acknowledgment is received for it. Is the ACK for fhe firsf fransmission or fhe sec¬ 
ond? This is an example of fhe retransmission ambiguity problem. If happens because 
unless fhe Timesfamps opfion is being used, an ACK provides only fhe ACK num¬ 
ber wifh no indicafion of which copy (e.g., firsf or second) of a sequence number 
is being ACKed. 

The paper [KP87] specifies fhaf when a fimeouf and refransmission occur, we 
cannof updafe fhe RTT esfimafors when fhe acknowledgmenf for fhe refransmif- 
fed dafa finally arrives. This is fhe "firsf parf" of Karn's algorifhm. If eliminafes 
fhe acknowledgmenf ambiguify problem by removing fhe ambiguify for purposes 
of compufing fhe RTT esfimafe. If is a requiremenf in [RFC6298]. 

If we were fo simply ignore refransmiffed segmenfs enfirely when setting fhe 
RTO, however, we would be failing fo fake info accounf some useful informafion 
being provided by fhe nefwork (i.e., fhaf if is probably experiencing some form of 
inabilify fo deliver packefs quickly). In such cases, if would be beneficial fo reduce 
fhe load on fhe nefwork by decreasing fhe refransmission rafe, af leasf unfil pack¬ 
efs are no longer being losf. This reasoning is fhe basis for fhe exponenfial backoff 
behavior we saw in Figure 14-1. 

TCP applies a backoff factor fo fhe RTO, which doubles each fime a subsequenf 
refransmission fimer expires. Doubling confinues unfil an acknowledgmenf is 
received for a segmenf fhaf was nof refransmiffed. Af fhaf fime, fhe backoff factor 
is sef back fo 1 (i.e., fhe binary exponenfial backoff is canceled), and fhe refrans¬ 
mission fimer refurns fo ifs normal value. Doubling fhe backoff factor on subse¬ 
quenf refransmissions is fhe "second parf" of Karn's algorifhm. Nofe fhaf when 
TCP fimes ouf, if also invokes congesfion confrol procedures fhaf alfer ifs sending 
rafe. (Congesfion confrol is discussed in defail in Chapfer 16.) Karn's algorifhm, 
fhen, really consisfs of fwo parfs. As quoted direcfly from fhe 1987 paper [KP87]: 

When an acknowledgement arrives for a packet that has been sent more than once 
(i.e., is retransmitted at least once), ignore any round-trip measurement based on 
this packet, thus avoiding the retransmission ambiguity problem. In addition, the 
backed-off RTO for this packet is kept for the next packet. Only when it (or a suc¬ 
ceeding packet) is acknowledged without an intervening retransmission will the 
RTO be recalculated from SRTT. 

This algorifhm has been a required procedure in a TCP implemenfafion for 
some fime (since [RFC1122]). There is an excepfion, however, when fhe TCP Time¬ 
sfamps opfion is being used (see Chapfer 13). In fhaf case, fhe acknowledgmenf 
ambiguify problem can be avoided and fhe firsf parf of Karn's algorifhm does nof 
apply. 
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14.3.2.4 RTT Measurement (RTTM) with the Timestamps Option 
The TCP Timestamps option (TSOPT), in addition to providing a basis for the 
PAWS algorithm we saw in Chapter 13, can be used for round-trip time measurement 
(RTTM) [RFC1323]. The basic formaf of fhe TSOPT was described in Chapfer 13. If 
allows fhe sender fo include a 32-bif number in a TCP segmenf fhaf is refurned in 
a corresponding acknowledgmenf. 

The fimesfamp value (TSV) is carried in fhe TSOPT of fhe inifial SYN and 
refurned in fhe TSER parf of fhe TSOPT in fhe SYN + ACK, which is how fhe inifial 
values for srtt, rttvar, and RTO are defermined. Because fhe inifial SYN "counfs" 
as dafa (i.e., if is refransmiffed if losf and consumes a sequence number), ifs RTT is 
measured. TSOPTs are also carried in ofher segmenfs, so fhe connecfion's RTT can 
be esfimafed on an ongoing basis. This seems sfraighfforward enough buf is made 
more complex because TCP does nof always provide an ACK for each segmenf if 
receives. For example, TCP offen provides one ACK for every ofher segmenf (see 
Chapfer 15) when large volumes of dafa are fransferred. In addifion, when dafa is 
losf, reordered, or successfully refransmiffed, fhe cumulafive ACK mechanism of 
TCP means fhaf fhere is nof necessarily any fixed correspondence befween a seg¬ 
menf and ifs ACK. To handle fhese challenges, TCPs fhaf use fhis opfion (mosf of 
fhem today—Linux and Windows included), employ fhe following algorifhm for 
faking RTT samples: 

1. The sending TCP includes a 32-bif fimesfamp value in fhe TSV porfion of 
fhe TSOPT in each TCP segmenf if sends. This field confains fhe value of 
fhe sender's TCP "clock" when fhe segmenf is fransmiffed. 

2. A receiving TCP keeps frack of fhe received TSV value fo send in fhe nexf 
ACK if generafes (in a variable fypically named TsRecent) and fhe ACK num¬ 
ber in fhe lasf ACK fhaf if senf (in a variable named Last ACK). Recall fhaf 
ACK numbers represenf fhe nexf in-order sequence number fhe receiver 
(i.e., sender of fhe ACK) expecfs fo see. 

3. When a new segmenf arrives, if if confains fhe sequence number mafching 
fhe value in LastACK (i.e., if is fhe nexf expecfed segmenf), fhe segmenf's 
TSV is saved in TsRecent. 

4. Whenever fhe receiver sends an ACK, a TSOPT is included such fhaf fhe 
fimesfamp value confained in TsRecent is placed in fhe TSER parf of fhe 
TSOPT in fhe ACK. 

5. A sender receiving an ACK fhaf advances ifs window subfracfs fhe TSER 
from ifs currenf TCP clock and uses fhe difference as a sample value fo 
updafe ifs RTT esfimafors. 

Timesfamps are enabled by defaulf in FreeBSD, Linux, and in response fo sys- 
fems fhaf use fhem for lafer versions of Windows. In Linux, fhe sysfem configura- 
fion variable net. ipv4 .tcp_tiinestamps dicfafes whefher or nof fhey are used 
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(value 0 for not used, value 1 for used). In Windows, their use is controlled by the 
Tcpl3230pts value in the registry area mentioned earlier. If it has the value 0, 
timestamps are disabled. If its value is 2, timestamps are enabled. This key has 
no default value (it is not in the registry by default). The default behavior is to use 
timestamps if a peer uses them when initiating a connection. 

14.3.3 The Linux Method 

The Linux RTT estimation procedure works somewhat differently from the stan¬ 
dard method. It uses a clock granularity of 1ms, which is finer than that of many 
other implementations, along with the TSOPT. The combination of frequent mea¬ 
surements of the RTT and the fine-grain clock contributes to a more accurate esti¬ 
mate of the RTT but also tends to minimize the value of rttvar over time [LSOO]. 
This happens because when a large enough number of mean deviation samples 
are accumulated, they tend to cancel each other out. This is one consideration for 
setting the RTO that differs somewhat from the standard method. Another relates 
to the way the standard method increases rttvar when an RTT sample is signifi¬ 
cantly below the existing RTT estimate srtt. 

To understand the second issue better, recall that the RTO is usually set to the 
value srtt + ^{rttvar). Consequently, any large change in rttvar causes the RTO to 
increase, whether the latest RTT sample is greater or less than srtt. This is counter¬ 
intuitive—if the actual RTT has dropped significantly, it is not desirable to have 
the RTO increase as a consequence. Linux deals with this issue by limiting the 
impact of significant downward drops in RTT sample values on the value of rttvar. 
We will now look at the details for the procedure Linux uses to set its RTO; the 
procedure addresses both of the issues just discussed. 

Linux keeps the variables srtt and rttvar, as with the standard method, but 
also two new ones called mdev and mdevjniax. The value mdev keeps the running 
estimate of the mean deviation using the standard algorithm for rttvar described 
before. The value mdev_max holds the maximum value of mdev seen over the last 
measured RTT and is never allowed to be less than 50ms. In addition, rttvar is 
regularly updated to ensure that it is at least as large as mdevjmax. Consequently, 
the RTO never dips below 200ms. 


Note 

The minimum RTO can be changed. TCP_RTO_l\/IIN, which is a kernei configu¬ 
ration constant, can be changed prior to recompiiing and instailing the kernei. 
Some Linux versions aiso aiiow it to be changed using the ip route command. 
When TCP is used in data-center networks where RTTs may be a few microsec¬ 
onds, 200ms minimum RTO can iead to severe performance degradations due to 
siow TCP recovery after packet loss in iocai switches. This is the so-calied TCP 
“incast” problem. Various solutions exist to this problem, including modification of 
the TCP timer granularity and minimum RTO to be on the order of microseconds 
[V09]. Such small minimum RTO values are not recommended for use on the 
global Internet. 
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Linux updates rttvar to the value of mdev_max whenever the maximum 
increases. It always sets the RTO to be the sum of srtt and ^{rttvar) and ensures 
fhaf fhe RTO never exceeds TCP_RTO_MAX, which defaulfs fo 120s. See [SK02] 
for more defails. We can see how fhe defails of all of fhis work in Figure 14-2. This 
figure also shows how fhe Timesfamps opfion operafes. 
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Figure 14-2 The TCP Timestamps option carries a copy of the TCP clock at the sender. ACKs return 
this value to the sender, which uses the difference (current clock - returned timestamp) 
to update its srtt and rttvar estimates. For clarity, only one set of timestamps is depicted. 
In this Linux system, the rttvar value is constrained to be at least 50 (millisecond) units, 
and the RTO has a lower bound of 200ms. 
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In Figure 14-2 we see a TCP connection using the Timestamps option as it 
starts up. The sender is a Linux 2.6 system and the receiver is a FreeBSD 5.4 system. 
Sequence numbers and timestamp values are depicted as relative values for clarity. 
In addition, only the sender's timestamps are shown. The figure is nof drawn exacfly 
fo femporal scale, in order fo make fhe numerical values easier fo read. Based on fhe 
inifial RTT measuremenf in fhis example, Linux makes fhe following updafes: 

• srtt = 16ms 

• mdev = (16/2)ms = Sms 

• rttvar = mdev_max = max{mdev, TCP_RTO_MIN) = max(8, 50) = 50ms 

• RTO = srtt + A{rttvar) = 16 + 4(50) = 216ms 

Affer fhe inifial SYN exchange, fhe sender supplies an ACK for fhe receiver's 
SYN and fhe receiver responds wifh a window updafe. As neifher of fhese packefs 
confains dafa (or SYN or FIN bif fields, which are counfed as dafa), fhey are nof 
fimed, and no RTT esfimafor updafe is performed when fhe window updafe arrives 
back af fhe sender. Segmenfs fhaf do nof confain dafa are nof reliably delivered by 
TCP, meaning fhey are nof refransmiffed if losf. These fypes of segmenfs do nof 
require a refransmission fimer fo be sef, because fhey are never refransmiffed. 


Note 

It is worth mentioning that TCP options, by themselves, are also not retransmitted 
or reliabiy delivered. Only when options are specificaliy arranged to be present in 
data segments (inciuding SYN and FIN segments) will options be retransmitted if 
iost, and then only as a side effect. 


When the application performs its first write, the sending TCP emits two seg¬ 
ments, each with a TSV value equal to 127. The values are identical in these two 
segments because the TCP clock has advanced less than 1ms (the sending TCP's 
clock granularity) between the first and second transmission. It is not unusual to 
see the clock fail to advance, or advance by small amounts, when the sender is 
sending multiple segments "back-to-back" in this fashion. 

The LastACK variable at the receiver holds the ACK number last sent by the 
receiver. In this example, LastACK starts with the value 1 because the last ACK 
sent was the SYN + ACK packet sent during connection establishment. When the 
first full-size segment arrives, its sequence number matches the LastACK value, 
so the TsRecent variable is updated to contain the value 127 from the arriving seg¬ 
ment's TSV. The arrival of the second segment does not update the TsRecent vari¬ 
able because its Sequence Number field does not match the value in LastACK. The 
ACK sent in response to the arriving packets includes the value of TsRecent in its 
TSER, and its transmission also causes the receiver to update the LastACK variable 
to the ACK number, 2801. 
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When this ACK arrives, TCP is able to make its second RTT measurement. 
It takes the current TCP clock and subtracts the TSER value from the arriving 
packet, forming fhe measuremenf m\m = 223 - 127 = 96. Wifh fhis measuremenf, 
fhe Linux TCP updafes fhe connecfion variables as follows: 

• mdev = mdev {3/A) + Im-srtfl(1/4) = 8(3/4) + 1801(1/4) = 26ms 

• mdev_max = max{mdev_max, mdev) = max(50,26) = 50ms 

• srtt = srtt (7/8) + m(l/8) = 16(7/8) + 96(1/8) = 14 + 12 = 26ms 

• rttvar = mdevjmax = 50ms 

• RTO = srtt + A{rttvar) = 26 + 4(50) = 226ms 

As menfioned previously, Linux TCP has several special modificafions fo fhe 
classic RTT esfimafion algorifhm fhaf merif discussion. Af fhe fime fhe classic 
algorifhms were developed, fhe fypical granularify of fhe TCP clock was 500ms 
and fhe Timesfamps opfion was nof in widespread use. If was fypical fo fake only 
one RTT sample per window and updafe fhe esfimafors accordingly. This is sfill 
used if fimesfamps are nof available or nof enabled. 

If only one RTT sample is faken per window, fhe rttvar ferm changes relafively 
slowly. Wifh fimesfamps and per-packef fimesfamp measuremenfs, many more 
measuremenfs can fake place. Because if is common for fhe RTT fo vary liffle from 
one packef fo fhe nexf in fhe same window of dafa, faking so many measuremenfs 
in a small period of fime (e.g., when fhe window is large) can lead fo fhe mean 
deviafion esfimafe being small (near zero, fhanks fo fhe law of large numbers 
[F68]). To address fhis issue, Linux mainfains fhe mdev variable as fhe running 
mean deviafion esfimafe buf sefs fhe RTO based on fhe rttvar, which is increased 
fo fhe maximum value of mdev during one window of dafa and also clamped fo 
be af leasf 50ms. Rttvar is allowed fo decrease only one fime, from one window fo 
fhe nexf. 

The sfandard approach uses a heavy weighf (facfor of 4) given fo fhe rttvar 
ferm, and consequenfly fhe RTO fends fo increase, even when fhe RTT is decreas¬ 
ing. Wifh a coarse-granularify clock (e.g., 500ms) fhis may have relafively liffle 
effecf because fhere are so few values fhe RTO can fake on. However, wifh a 
finer-granularify clock, such as fhe 1ms used by Linux, fhis can be of concern. To 
address fhis issue, Linux handles fhe case where fhe RTT is decreasing by giving 
less weighf fo fhe new sample if if is below fhe "lower end" of fhe esfimafed RTT 
range {srtt - mdev). The complefe relafionship is as follows: 

if {m < {srtt - mdev)) 

mdev = (31/32) mdev + (1/32) Isrff - ml 
else 

mdev = (3/4) mdev + (1/4) I srtt - ml 
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The conditional determines if the new RTT sample is below the bottom of fhe 
range of whaf an RTT measuremenf is expecfed fo be. If so, fhe new sample indi¬ 
cafes fhaf fhe connecfion may be experiencing a significanfly reducing RTT. To 
avoid increasing mdev (and consequenfly rttvar and RTO) in such cases, fhe new 
mean deviafion sample, Isrff - ml, is given an 8x reduced weigh! versus ifs nor¬ 
mal weighfing. Overall, fhis resulfs in avoiding fhe problem of increasing fhe RTO 
in cases where fhe RTT is decreasing. For an in-depfh discussion of fhese issues, 
please see [LSOO] and [SK02]. In [RKS07], fhe aufhors evaluafed fhe TCP RTT esfi- 
mafion algorifhms wifh various operafing sysfems on 2.8 million TCP flows. They 
conclude fhaf fhe Linux esfimafor is fhe mosf effecfive among fhose sfudied, largely 
because of ifs relafively quick convergence, buf fhaf if can also be funed mosf effec- 
fively by reducing fhe influence of RTT variance on seffing fhe RTO. 

Refurning now fo Figure 14-2, when ACK 7001 is generafed af fhe receiver, we 
see fhaf ifs TSER confains a copy of a TSV value, nof from fhe mosf recenfly arriv¬ 
ing segmenf, buf insfead from fhe oldesf segmenf fhaf has nof been ACKed. When 
refurned fo fhe sender, fhis ACK causes fhe RTT sample fo be measured from fhe 
firs! of fhe fwo segmenfs, rafher fhan from fhe lasf one senf. This is how fhe fime- 
sfamp algorifhm works wifh delayed or ofherwise errafic ACKs. When fhe RTT 
sample from fhe oldesf packef is measured, fhe RTT sample is faken fo be fhe fime 
fhe sender should waif fo expecf an ACK, rafher fhan fhe acfual nefwork RTT. This 
is imporfanf because fhe sender needs fo base ifs RTO on fhe rafe af which if can 
expecf ACKs from fhe receiver, which may be less fhan fhe packef sending rafe. 

14.3.4 RTT Estimator Behaviors 

As we have seen, substantial innovation and engineering have been invested in 
how to set TCP's RTO and how to estimate the RTT. Figure 14-3 shows how the 
more popular estimators work, based on applying the standard and Linux algo¬ 
rithms to a synthetic data set. The Is RTO minimum recommended by [RFC6298] 
has been removed for the standard method for illustration. Most real-world TCP 
implementations today violate this directive anyhow [RKS07]. 

The graph shows a time-series plot of 200 synthetic values drawn from two 
Gaussian probability distributions, N(200, 50) and N(50,50). The first distribution 
is used for the first 100 points, and the second is used for the second 100 points. 
Any negative samples were made positive by sign inversion (applicable only to the 
second distribution). Each plus (+) indicates a specific sample value. The signifi¬ 
cant drop in sample values after sample 100 is apparent, and it is easy to see how 
the Linux approach drops the RTO almost immediately after sample 100, while the 
standard approach requires another 20 samples. 

If we focus now on the Linux rttvar line, we can see that it remains relatively 
constant. This is because of the 50ms minimum on the mdev_max value (and con¬ 
sequently the rttvar value). This has the effect of making the Linux RTO value 
always at least 200ms, and all unnecessary retransmissions are avoided (although 
the timer may not fire as quickly, leading to reduced performance when packets 
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Sanple Nunber 

Figure 14-3 The Linux and standard RTO assignment and RTT estimation algorithms applied to 
synthetic (pseudorandom) sample points. The first 100 points are drawn from an N(200, 
50) distribution, and the second 100 are drawn from an N(50,50) distribution with neg¬ 
ative values turned positive. Linux avoids the increase in RTO when the mean drops 
after sample 100. With Linux, the minimum RTO is effectively set to 200ms, so after 
sample 120, the standard method is tighter. Linux avoids setting the RTO too low in all 
cases for this example. The standard approach runs into potential problems at samples 
78 and 191. 


are lost). The standard approach runs into potential problems at samples 78 and 
191, where a spurious retransmission could take place. We shall discuss this problem 
later. 

14.3.5 RTTM Robustness to Loss and Reordering 

The TSOPT has been shown to work properly when packets are not lost, whether 
or not the receiver delays some ACKs. The algorithm also operates correctly in the 
following cases: 

• Out-of-order segments: When a receiver receives an out-of-order segmenf, 
fypically because of fhe loss of a previous segmenf, an ACK is supposed 
fo be generafed immediafely fo help fhe fasf refransmif algorifhm (see 
Secfion 14.5) operafe. This ACK includes as ifs TSER value fhe TSV value 
from fhe mosf recenf in-order segmenf fhaf arrived af fhe receiver (i.e., fhe 
mosf recenf one fo advance fhe window, which is generally not fhe arriving 
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out-of-order segment). This tends to cause the sender's RTT sample values 
to increase, leading to a corresponding increase in the sender's RTO. When 
packets are being reordered, this is beneficial because it tends to allow the 
sender a bit more time to realize that packets are reordered rather than lost 
before initiating a retransmission. 

• Successful retransmissions: When a receiver receives a segment that fills a 
hole in its receive buffer (e.g., because of the successful arrival of a retrans¬ 
mission), the window generally jumps forward. In this case, the value 
carried in the TSER of the corresponding ACK is from the most recently 
arriving segment. This is useful because if an older segment's TSV were 
used, it might be more than one RTO's worth of time old, leading to a large 
unwanted bias in the sender's RTT estimate. 

The example in Figure 14-4 illustrates these points. Assume that three seg¬ 
ments, each containing 1024 bytes, are received in the following order: segment 1 
with bytes 1-1024, segment 3 with bytes 2049-3072, and then segment 2 with bytes 
1025-2048. 



Last ACK 


1025 


3073 


TsRecent 


1 


2 


Figure 14-4 When segments are reordered, the returned timestamp is that of the last segment to 
advance the receiver's window (not the largest timestamp to arrive at the receiver). 
This biases the sender's RTO toward overestimating the RTT during periods of packet 
reordering and reduces its aggressiveness. 


The ACKs sent back in Figure 14-4 are ACK 1025 with the timestamp from 
segment 1 (a normal ACK for data that was expected), ACK 1025 with the time- 
stamp from segment 1 (a duplicate ACK in response to the in-window but out-of- 
sequence segment), then ACK 3073 with the timestamp from segment 2 (not the 
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timestamp from segment 3). This has the effect of overesfimafing fhe RTT when 
segmenfs are reordered (or losf). A larger RTT esfimafe leads fo a larger RTO, mak¬ 
ing fhe sender less aggressive fo refransmif. This is especially desirable in cases 
where packef reordering occurs, because aggressive refransmissions are likely fo 
be spurious. 

So, we have seen fhaf fhe Timesfamps opfion allows fhe sender fo make esfi- 
mafes of fhe RTT even when fhere are packef delays, losses, and reorderings. The 
sender can measure fhe RTT using whafever values if wishes fo in fhe opfion, buf 
fhese unifs musf af leasf be proporfional fo real fime and of a reasonable granu- 
larify fo be compafible wifh TCP sequence numbers and plausible link rafes (see 
[RFC1323] for more defails on fhis). In parficular, fo be useful fo fhe sender, fhe 
TCP clock musf "fick" af leasf once for any plausible RTT. On fhe ofher hand, if 
should nof change fasfer fhan once every 59ns. If if did, fhe 32-bif TSV value hold¬ 
ing fhe TCP clock value could wrap around wifhin fhe maximum fime permiffed 
by fhe IP layer for a single packef fo exisf (255s) [ID1323b]. Assuming all fhis fo be 
correcf, fhe RTO value can now be used fo frigger refransmissions. 


14.4 Timer-Based Retransmission 

Once a sending TCP has esfablished ifs RTO based upon measuremenfs of fhe 
fime-varying values of effecfive RTT, whenever if sends a segmenf if ensures fhaf 
a refransmission fimer is sef appropriafely. When seffing a refransmission fimer, 
fhe sequence number of fhe so-called fimed segmenf is recorded, and if an ACK 
is received in fime, fhe refransmission fimer is canceled. The nexf fime fhe sender 
emifs a packef wifh dafa in if, a new refransmission fimer is sef, fhe old one is 
canceled, and fhe new sequence number is recorded. The sending TCP fherefore 
confinuously sefs and cancels one refransmission fimer per connecfion; if no dafa 
is ever losf, no refransmission fimer ever expires. 


Note 

This observation proved somewhat of a surprise to the designers of the host 
operating systems. In a typical operating system, timers are used to signal a wide 
variety of events, and the implementation of the timer facility is tuned to efficiently 
set up and expire timers (which invoke system functions). For TCP, however, the 
requirement is for efficient setting and resetting or canceling of timers; if TCP is 
working well, timers never expire. 


When TCP fails fo receive an ACK for a segmenf if has fimed on a connec¬ 
fion wifhin fhe RTO, if performs a fimer-based refransmission. We have seen fhis 
already in Figure 14-1. TCP considers a fimer-based refransmission as a fairly 
major evenf; if reacfs very caufiously when if happens by quickly reducing fhe rafe 
af which if sends dafa info fhe nefwork. If does fhis in fwo ways. The firsf way is 
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to reduce its sending window size based on congestion control procedures (see 
Chapter 16). The other way is to keep increasing a multiplicative backoff factor 
applied to fhe RTO each fime a refransmiffed segmenf is again refransmiffed. This 
is implemenfed in fhe "second parf" of Karn's algorifhm menfioned previously. 
In parficular, fhe RTO value is (temporarily) mulfiplied by fhe value y to form fhe 
backed-off fimeouf when mulfiple refransmissions of fhe same segmenf occur: 

RTO = jRTO 

In ordinary circumsfances, y has fhe value 1. On subsequenf refransmissions, 
y is doubled: 2,4,8, and so forfh. There is fypically a maximum backoff facfor fhaf y 
is nof allowed fo exceed (Linux ensures fhaf fhe used RTO never exceeds fhe value 
TCP_RTO_MAX, which defaulfs fo 120s). Once an accepfable ACK is received, y is 
resef fo 1. 

14.4.1 Example 

We can see fhe acfion of fhe refransmission fimer by creafing a connecfion similar 
fo fhe one we looked af in Figures 14-1 and 14-2, buf where we purposely drop fhe 
segmenf wifh sequence number 1401 fwice (see Figure 14-5). 

For fhis example, we send fhe TCP segmenfs fhrough a special funcfion fhaf 
is able fo drop fhem a cerfain number of fimes based on fheir TCP sequence num¬ 
bers. This adds a bif of exfra delay fo fhe RTT as compared wifh Figure 14-2. The 
connecfion sfarfs ouf as before, excepf when fhe pair of segmenfs wifh sequence 
numbers 1 and 1401 is senf, fhe second packef is dropped. Presumably fhe firsf of 
fhese segmenfs reaches fhe receiver, buf fhe receiver is delaying ACKs and does nof 
respond immediately Lacking a response in 219ms, fhe sender's refransmission 
fimer expires, causing fhe packef wifh sequence number 1 fo be resenf (fhis fime 
wifh TSV value 577). Ifs arrival elicifs an ACK from fhe receiver, which ref urns fo 
fhe sender. Because fhis ACK acknowledges dafa and moves fhe sender's window 
forward, ifs TSER value is used fo update fhe srtt and RTO values fo 34 and 234, 
respecfively. 

The nexf fhree ACKs are generafed in response fo packefs fhaf arrive af fhe 
receiver. The ACKs wifh fhe asferisks (’^) are all duplicafe ACKs and confain SACK 
informafion. We will discuss fhe effecf of duplicafe ACKs and SACKs in Secfions 
14.5 and 14.6. For now, because fhese ACKs do nof move fhe sender's window for¬ 
ward, fheir TSER values are nof used. 

Wifh fhe evenfual refransmission and arrival of segmenf 1401 (af TCP clock 
fime 911) af fhe receiver, fhe repair period is complete, and fhe receiver responds 
wifh ACK number 7001, indicafing fhaf all dafa has been received. 

The refransmission fimer provides a form of "lasf-resorf resfarf" for a TCP 
connecfion fhaf has ceased fo move dafa fhrough fhe nefwork regularly. In mosf 
cases if is unnecessary (and undesirable) fo have refransmission fimers frigger 
refransmissions because fhe RTO is generally esfablished fo be larger fhan fhe 
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Figure 14-5 Segment 1401 is forcibly dropped twice. This results in a timer-based retransmission at the sender. 

The srtt, rttvar, and RTO values are updated only by a returning ACK that advances the sender's 
window. ACKs with asterisks (*) include SACK information. 
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typical RTT (by about a factor of 2 or more), so a fimer-based refransmission offen 
leads fo underufilizafion of fhe nefwork capacify. Forfunafely, TCP has anofher 
mefhod for defecfing and repairing losf packefs, which is almosf always more effi- 
cienf fhan fimer-based refransmissions. If is called fasf refransmif because if does 
nof require fhe expirafion of a refransmission fimer fo be invoked. 


14.5 Fast Retransmit 

Fasf refransmif [RFC5681] is a TCP procedure fhaf can induce a packef refransmis¬ 
sion based on feedback from fhe receiver insfead of requiring a refransmission 
fimer fo expire. As a resulf, packef loss can offen be more quickly and efficienfly 
repaired using fasf refransmif fhan wifh fimer-based refransmission. A fypical 
TCP implemenfs bofh fasf refransmif and fimer-based refransmission. Before 
we describe fasf refransmif in more defail, if is imporfanf fo realize fhaf TCP is 
required fo generafe an immediafe acknowledgmenf (a "duplicafe ACK") when 
an ouf-of-order segmenf is received, and fhaf fhe loss of a segmenf implies ouf-of- 
order arrivals af fhe receiver when subsequenf dafa arrives. When fhis happens, a 
hole is creafed af fhe receiver. The sender's job fhen becomes filling fhe receiver's 
holes as quickly and efficienfly as possible. 

The duplicafe ACKs senf immediafely when ouf-of-order dafa arrives are nof 
delayed. The reason is fo lef fhe sender know fhaf a segmenf was received ouf of 
order, and fo indicafe whaf sequence number is expecfed (i.e., where fhe hole is). 
When SACK is used, fhese duplicafe ACKs fypically confain SACK blocks as well, 
which can provide informafion abouf more fhan one hole. 

A duplicafe ACK (wifh or wifhouf SACK blocks) arriving af a sender is a 
pofenfial indicator fhaf a packef senf earlier has been losf. As we discuss in Sec- 
fion 14.8 in more defail, duplicafe ACKs can also appear when fhere is packef 
reordering in fhe nefwork—if a receiver receives a packef for a sequence number 
beyond fhe one if is expecfing nexf, fhe expecfed packef could be eifher missing 
or merely delayed. Because we generally do nof know which one, TCP waifs for a 
small number of duplicafe ACKs (called fhe duplicate ACK threshold or dupthresh) 
fo be received before concluding fhaf a packef has been losf and inifiafing a fasf 
refransmif. Tradifionally, dupthresh has been a consfanf (wifh value 3), buf some 
nonsfandard implemenfafions (including Linux) alfer fhis value based on fhe cur- 
renf measured level of reordering (see Secfion 14.8). 

A TCP sender observing af leasf dupthresh duplicafe ACKs refransmifs one 
or more packefs fhaf appear fo be missing wifhouf waifing for a refransmission 
fimer fo expire. If may also send addifional dafa fhaf has nof yef been senf. This is 
fhe essence of fhe fasf refransmif algorifhm. Packef loss inferred by fhe presence 
of duplicafe ACKs is assumed fo be relafed fo nefwork congesfion, and congesfion 
confrol procedures (discussed in Chapfer 16) are invoked along wifh fasf refrans¬ 
mif. Wifhouf SACK, no more fhan one segmenf is fypically refransmiffed unfil 
an accepfable ACK is received. Wifh SACK, ACKs confain addifional informafion 
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allowing the sender to fill more than one hole in the receiver per RTT. We explore 
the use of SACK wifh fasf refransmif affer illusfrafing an example of fhe basic fasf 
refransmif algorifhm. 

14.5.1 Example 

In fhe following example, we creafe a TCP connecfion similar fo fhe one from 
Figure 14-4, excepf fhis fime we drop segmenfs 23801 and 26601 and SACK is dis¬ 
abled. We will see how TCP uses fhe basic fasf refransmif algorifhm fo repair fhese 
holes. The sender is a Linux 2.6 sysfem and fhe receiver is a FreeBSD 5.4 sysfem. 
The plof in Figure 14-6 from Wireshark's Sfafisfics I TCP Sfream Graph I Time- 
Sequence Graph (tcptrace) screen shows fasf refransmif in acfion. 



Figure 14-6 In this plot, TCP sequence numbers are on the y-axis and time is on the x-axis. Outgo¬ 
ing segments are displayed as darker line segments, and the incoming ACK numbers 
appear as lighter gray segments. Fast retransmit is triggered by the arrival of the third 
duplicate ACK at time 0.993s. This connection does not use SACK, so it is able to repair 
at most only one hole per RTT. Additional duplicate ACKs arriving after the third cause 
the sender to send new segments (not retransmissions). A "partial ACK" arriving at 
time 1.32 causes the next retransmission. 
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This plot indicates the relative sending sequence number on the y-axis and 
the elapsed time on the x-axis. The black vertical I-shaped extents indicate the 
span of sequence numbers present in the transmitted segment. The blue lines in 
Wireshark (lower light gray line in Figure 14-6) indicate ACK numbers in return¬ 
ing packets. At approximately time 1.0, sequence number 23801 is retransmitted 
because of fhe fasf refransmif algorifhm (fhe inifial fransmission is nof visible 
because if was dropped by fhe process af fhe sender below fhe TCP protocol layer). 
The refransmission is friggered by fhe arrival of fhe fhird duplicate ACK, as illus- 
frafed by fhe repeafed lower line segmenfs. The refransmif can also be seen using 
fhe basic analysis screen of Wireshark (see Figure 14-7). 


fast rtx nosack-trace.td - Wireshark 


File Ed\t View Go Capture Analyze Statistics Telephony Tools Help 

|No. Time Source Destination Protocol Info 

400.815379 169.229.62.97 70.231.128.1 TCP 6666 > 1044 [ACK] Seq*l Ack*23801 Win*231616 Len=0 TSV=488i 

41 0.820951 70.231.128.151 169.229.62.9 TCP 1044 > 6666 [ACK] Seq=37801 Ack=l win=5808 Len=1400 TSV=29 

42 0.821692 70.231.128.151 169.229.62.9 TCP 1044 > 6666 [ACK] Seq«39201 Ack=l Win=5808 Len*1400 TSV*29 

43 0.822282 70.231.128.151 169.229.62.9 TCP 1044 > 6666 [ACK] Seq=40601 Ack=l win=5808 Len=1400 TSV=29 


44 0.853283 169.229.62.97 70.231.128.1 TCP [TCP Window Update] 6666 > 1044 [ACK] Seq*l Ack=23801 Win 


6666 > 1044 


1044 > 6666 [ACK] Seq*42001 Ack«l Win*5808 Len*1400 TSV*29 


Ml INI I II II mi INI in I 

46 0.893441 70.231.128.151 169.229.62.9 TCP 

__ __ 

48 0.929728 70.231.128.151 169.229.62.9 TCP 1044 > 6666 [ACK] 5eq.43401 Ack-1 Win=58Q8 Len=1400 TSV=29 




Seq=238Ql Ack=l Win 


51 0.998380 169.229.62.97 70. 231.128.1 TCP [TCP Dup ACK 44#4] 6666 > 1044 [ACK] Seq=l Ack=23801 Win=2 


6666 > 1044 


M-WMiftawMiiMiiiiiMiiiiUMNIH II 

53 1.040409 70.231.128.151 169. 229. 62.9 TCP 






1044 > 6666 [ACK] Seq-44801 Ack-1 Win-5808 Len-1400 TSV-29 


1 541.069333 

169.229.62.97 

70.231.128.1 TCP 

[TCP 

Dup 

ACK 

44#6] 

6666 

> 

1044 

[ACK] 

Seq=l 

Ack=23801 

wi n=2| 

55 1.106530 

169.229.62.97 

70.231.128.1 TCP 

[TCP 

Dup 

ACK 

44#-7] 

6666 

~> 

1044 

[ACK] 

Seq=l 

Ack=23801 

Win=2l 

1 56 1.110553 

70.231.128.151 

169.229.62.9 TCP 

1044 

> 6666 

;ack] 

Seq=46201 Acl 

=1 wir 

=5808 

Len=1400 TSV=29 

1 57 1.142507 

169.229.62.97 

70.231.128.1 TCP 

[TCP 

DUp 

ACK 44#8] 

6666 

> 

1044 

[ACK] 

Seq=l 

Ack=23801 

vyin =21 


I Ml I Ml I II IMi II 11 II I Ml I ■ 

59 1.182439 70.231.128.151 169. 229. 62.9 TCP 


1044 > 6666 [ACK] Seq=47601 Ack=l Win«5808 Len=1400 TSV=29 


1 60 1.212966 

169.229.62.97 

70.231.128.1 TCP 

[TCP 

Dup 

ACK 44#10] 

6666 

> 1044 

[ACK] 

seq=l 

Ack=23801 

win=| 

1 61 1.250471 

169.229.62.97 

70.231.128.1 TCP 

[TCP 

Dup 

ACK 44#11] 

6666 

> 1044 

[ACK] 

Seq=l 

Ack=23801 

win^ 


62 1.254697 70.231.128.151 169.229.62.9 TCP 1044 > 6666 [ACK] Seq=49001 -icl =1 win=5808 Len=1400 TSV=29 


641.321104 169.229.62.97 70.231.128.1 TCP 


651.321602 169.229.62.97 70.231.128.1 TCP 




6666 > 1044 [ACK] Seq=l Acl:=26601 win-230216 Len=0 TSV=488i 




[TCP Window update] 6666 > 1044 [ACK] Seq=l Ack=26601 Win= 


Seq=266Ql Ack=l win 


6666 > 1044 [ACK] Seq=l Ack*26601 Win*2 


6666 > 1044 


6666 > 1044 [ACK] Seq=l Ack«26601 Win*2 


iA»wM>Mfc»gin!TnfTnTMi III m n II li I i ■ I'l ■' i ■ 111» 

70 1.434094 70.231.128.151 169.229.62.9 TCP 1044 > 6666 [ACK] Seq=50401 Ack=l Win=5808 Len=1400 TSV=29 


71 1.463026 169.229.62.97 70. 231.128.1 TCP 


[TCP Dup ACK 65#4J 6666 > 1044 [ACK] Seq=l Ack=26601 Win=2 


721.497273 169.229.62.97 70.231.128.1 TCP 6666 > 1044 [ACK] Seq-1 Ack-50401 Win-209216 Len-0 TSV-488: 

74 1. 501522 70.231.128.151 169.229.62.9 TCP 1044 > 6666 [ACK] Seq-51801 Ack-1 win-5808 Len-1400 TSV-29A 

> 


Figure 14-7 The TCP exchange showing relative sequence numbers. Packets 50 and 66 are retransmissions. 

Packet 50 is retransmitted because of the fast retransmit algorithm, which triggers as a result of 
three duplicate ACKs. No retransmission timer is required, so recovery is relatively quick. 


The first line of Figure 14-7 (number 40) indicates the first time ACK 23801 is 
received. Wireshark highlights (in red, appearing as black in Figure 14-7) other 
"interesting" TCP packets. Such packets differ from what would be expected for 
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a TCP transfer with no losses or other anomalies. We see window updates, dupli¬ 
cate ACKs, and retransmissions. The window update at time 0.853 is an ACK with 
a duplicate sequence number (because no data is being carried) but contains a 
change to the TCP flow control window. The window changes from 231,616 bytes 
to 233,016 bytes. Thus, it is not counted toward the three-duplicate-ACK threshold 
required to initiate a fast retransmit. Window updates merely provide a copy of 
the window advertisement. We will look at these in more detail in Chapter 15. 

The packets arriving at times 0.890, 0.926, and 0.964 are all duplicate ACKs for 
sequence number 23801. The arrival of the third of these duplicate ACKs triggers 
the fast retransmit of segment 23801 at time 0.993. This can also be seen using 
Wireshark's Statistics I Flow Graph feature (see Figure 14-8). 



Figure 14-8 The retransmission at time 0.993 is triggered by the fast retransmit algorithm after 
receiving duplicate ACKs at times 0.890, 0.926, and 0.964. The ACK at time 0.853 is not 
considered a duplicate ACK because it contains a window update. 
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Here we see, in a slightly different way, the same fast retransmit at time 0.993. 
We can also see the second retransmission that takes place at time 1.326. This 
second retransmission takes place because of fhe arrival of fhe ACK af fime 1.322. 

The second refransmission is somewhaf differenf from fhe firsf. When fhe 
firsf refransmission fakes place, fhe sending TCP nofes fhe highesf sequence num¬ 
ber if had senf jusf before if performed fhe refransmission (43401 + 1400 = 44801). 
This is called fhe recovery point. TCP is considered fo be recovering from loss affer 
a refransmission unfil if receives an ACK fhaf mafches or exceeds fhe sequence 
number of fhe recovery poinf. In fhis example, fhe ACKs af fimes 1.322 and 1.321 
are nof for 44801, buf insfead for 26601. This number is larger fhan fhe previous 
highesf ACK value seen (23801), buf nof enough fo meef or exceed fhe recovery 
poinf (44801). This type of ACK is called a partial ACK for fhis reason. When par- 
fial ACKs arrive, fhe sending TCP immediafely sends fhe segmenf fhaf appears fo 
be missing (26601 in fhis case) and confinues fhis way unfil fhe recovery poinf is 
mafched or exceeded by an arriving ACK. If permiffed by congesfion confrol pro¬ 
cedures (see Chapfer 16), if may also send new dafa if has nof yef senf. 

This example ihusfrafes fhe behavior of a TCP nof using SACKs, when using 
fasf refransmif, and when performing addifional refransmifs during recovery 
based on fhe "NewReno" sending algorifhm [RFC3782]. Because no SACKs are 
being used, fhe sender can learn of af mosf one receiver hole per round-frip fime, 
indicafed by fhe increase in fhe ACK number of refurning packefs, which can only 
occur once a refransmission filling fhe receiver's lowesf-numbered hole has been 
received and ACKed. 

The precise behavior during recovery varies, depending on fhe type and 
configuration of fhe TCP sender and receiver. This example illusfrafes a non- 
SACK sender using fhe NewReno algorifhm, a fairly common arrangemenf. Wifh 
NewReno, parfial ACKs keep fhe sender in recovery as described. Wifh older TCP 
varianfs (plain Reno), fhere is no such concepf, and any accepfable ACK brings fhe 
TCP ouf of recovery. Doing so can presenf some performance problems for TCP, 
and fhese are discussed in defail in Chapfer 16. NewReno and SACK, which we 
discuss nexf, are somefimes called "advanced loss recovery" fechniques fo disfin- 
guish fhem from fhe older approaches. 


14.6 Retransmission with Seiective Acknowiedgments 

Wifh fhe sfandardizafion of fhe Selecfive Acknowledgmenf opfions in [RFC2018], 
a SACK-capable TCP receiver is able fo describe dafa if has received wifh sequence 
numbers beyond fhe cumulafive ACK Number field if sends in fhe primary porfion 
of fhe TCP header. As we menfioned before, gaps befween fhe ACK number and 
ofher in-window dafa cached af fhe receiver are called holes. Dafa wifh sequence 
numbers beyond fhe holes are called ouf-of-sequence dafa because fhaf dafa is nof 
configuous, in forms of ifs sequence numbers, wifh fhe ofher dafa fhe receiver has 
already received. 
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The job of a sending TCP is to fill the holes in the receiver by retransmitting 
any data the receiver is missing, yet to be as efficient as possible by not resend¬ 
ing data the receiver already has. In many circumstances, the properly operating 
SACK sender is able to fill these holes more quickly and with fewer unnecessary 
retransmissions than a comparable non-SACK sender because it does not have to 
wait an entire RTT to learn about additional holes. When the SACK option is being 
used, an ACK can be augmented with up to three or four SACK blocks that contain 
information about out-of-sequence data at the receiver. Each SACK block contains 
two 32-bit sequence numbers representing the first and last sequence numbers 
(plus 1) of a continuous block of out-of-sequence data being held at the receiver. 

A SACK option that specifies n blocks has a length of 8n + 2 bytes, so the 
40 bytes available to hold TCP options can specify a maximum of four blocks. It 
is expected that SACK will often be used in conjunction with the TSOPT, which 
takes an additional 10 bytes (plus 2 bytes of padding), meaning that SACK is typi¬ 
cally able to include only three blocks per ACK. 

With three distinct blocks, up to three holes can be reported to the sender. If 
not limited by congestion control (see Chapter 16), all three could be filled within 
one round-trip time using a SACK-capable sender. An ACK packet containing one 
or more SACK blocks is sometimes called simply a "SACK." 

14.6.1 SACK Receiver Behavior 

A SACK-capable receiver is allowed to generate SACKs if it has received the 
SACK-Permitted option during the TCP connection establishment (see Chapter 
13). Generally speaking, a receiver generates SACKs whenever there is any out-of- 
order data in its buffer. This can happen either because data was lost in transit, or 
because it has been reordered and newer data has arrived at the receiver before 
older data. We consider the first case here and discuss the second one later. 

The receiver places in the first SACK block the sequence number range con¬ 
tained in the segment it has most recently received. Because the space in a SACK 
option is limited, it is best to ensure that the most recent information is always 
provided to the sending TCP, if possible. Other SACK blocks are listed in the order 
in which they appeared as first blocks in previous SACK options. That is, they are 
filled in by repeating the most recently sent SACK blocks (in other segments) that 
are not subsets of another block about to be placed in the option being constructed. 

The purpose of including more than one SACK block in a SACK option and 
repeating these blocks across multiple SACKs is to provide some redundancy in 
the case where SACKs are lost. If SACKs were never lost, [RFC2018] points out that 
only one SACK block would be required per SACK for full SACK functionality. 
Unfortunately, SACKs and regular ACKs are sometimes lost and are not retrans¬ 
mitted by TCP unless they contain data (or the SYN or FIN control bit fields are 
turned on). 
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14.6.2 SACK Sender Behavior 

Although it is necessary for a SACK-capable receiver to generate proper SACK 
information to make full use of SACK, if is nof sufficienf for a TCP connecfion fo 
benefif from SACKs. A SACK-capable sender musf be used fhaf freafs fhe SACK 
blocks appropriafely and performs selective retransmission by sending only fhose 
segmenfs missing af fhe receiver, a process also called selective repeat. The SACK 
sender keeps frack of any cumulafive ACK informafion if receives (like any TCP 
sender), plus any SACK informafion if receives. If uses fhe SACK informafion if 
receives in ACKs generafed af fhe receiver fo avoid refransmiffing dafa fhe receiver 
reporfs fhaf if already has. One way if can do fhis is fo keep a "SACKed" indicafion 
for each segmenf in ifs refransmission buffer fhaf is sef whenever a corresponding 
range of sequence numbers arrives in a SACK. 

When a SACK-capable sender has fhe opporfunify fo perform a refransmis¬ 
sion, usually because if has received a SACK or seen mulfiple duplicafe ACKs, if 
has fhe choice of whefher if sends new dafa or refransmifs old dafa. The SACK 
informafion provides fhe sequence number ranges presenf af fhe receiver, so fhe 
sender can infer whaf segmenfs likely need fo be refransmiffed fo fill fhe receiver's 
holes. The simplesf approach is fo have fhe sender firsf fill fhe holes af fhe receiver 
and fhen move on fo send more new dafa [RFC3517] if fhe congesfion confrol pro¬ 
cedures allow. This is fhe mosf common approach. 

There is one excepfion fo fhis behavior. In [RFC2018], fhe currenf specificafion 
for SACK opfions, SACK blocks are considered advisory. This means fhaf a receiver 
could provide a SACK fo fhe sender indicafing fhaf some sequence numbers have 
been received successfully and fhen change ifs mind lafer ("renege"). Because of 
fhis, fhe SACK sender is nof able fo free ifs refransmission buffer of dafa if has 
received only a SACK for; if is permitted fo free a block of dafa only once fhe regu¬ 
lar TCP ACK number of fhe receiver has passed by fhe highesf sequence number of 
fhis dafa. The rule also affecfs whaf TCP is supposed fo do when a refransmission 
fimer expires. When a sending TCP inifiafes a fimer-based refransmission, any 
informafion regarding ouf-of-sequence dafa af fhe receiver derived from SACKs is 
supposed fo be forgoffen. If ouf-of-sequence dafa remains af fhe receiver, fhe ACK 
for fhe refransmiffed segmenf confains addifional SACK blocks fhe sender can 
fhen use. Forfunafely, reneging is rare and discouraged. 

14.6.3 Example 

To undersfand how fhe use of SACK alfers fhe sender and receiver behaviors, we 
repeal fhe preceding fasf refransmif experimenf wifh fhe same sefup (dropping 
sequence numbers 23601 and 28801), buf fhis fime fhe sender and receiver are 
using SACK. To gef an immediafe idea of whaf happens, we again use Wireshark's 
TCP sequence number (tcptrace) plof funcfion (see Figure 14-9). 
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Time/Sequence Graph (tcptrace) 


Second Retransmission 
Sent During Same RTT 


First Retransmission Triggered 
by First Duplicate ACK 


Figure 14-9 Fast retransmit is triggered by the arrival of the first duplicate ACK containing SACK informa¬ 
tion. The arrival of the next ACK allows the sender to learn of the second missing segment and 
retransmit it within the same RTT. 


Figure 14-9 is similar to Figure 14-6, but the SACK sender has not had to wait 
an RTT to retransmit lost segment 28801 after retransmitting segment 23601. This 
is a result of fhe SACK informafion confained in fhe arriving ACKs. We will look 
af fhose in defail lafer, buf firsf we verify fhe negofiafion of fhe SACK-Permiffed 
opfion during connecfion sefup. This can be seen in Figure 14-10. 

As expecfed, fhe receiver indicafes ifs abilify fo use SACKs wifh fhe SACK- 
Permiffed opfion. The SYN packef from fhe sender, fhe firsf packef of fhe frace, 
also confains an idenfical opfion. These opfions are presenf only af connecfion 
sefup, and fhus fhey only ever appear in segmenfs wifh fhe SYN bif field sef. 

Once fhe connecfion is permiffed fo use SACKs, packef loss generally causes 
fhe receiver fo sfarf producing SACKs. For example, Wireshark shows fhe confenfs 
of fhe SACK opfions when fhe firsf SACK is selecfed (see Figure 14-11). 
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0 fast_rtx_sack-trace.td - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony Tools Help 

EH w at i( I 6 s X Si iSi ; Cl ^ ilnilci €; o. st □ s is « isi 

No. Time Source Destination Protocol Info 

1 Q.QOQQOO'7Q.231.128.151 169.229.62.9 TCP_~ 1043 > 6666 [SYN] Seq=Q Win=5808 Len=0 MSS=1412 SACK_PERM=1 


2 Q.Q1472Q 169.229.62.97 70.231.128.1 TCP 6666 > 1043 [SYN, ACK] Seq=Q ACk=l Win=146Q Len=Q MSS=146Q 


3 0.016264 70.231.128.151 169.229.62.9 TCP 1043 > 6666 [ACK] Seq=l ACk=l Win=5808 Len=0 TSV=297077738 Tv 

> 

Q Frame 2: 86 bytes on wire (688 bits), 86 bytes captured (688 bits) 

a Ethernet II, Src: 00:02:3b:02:a6:eO (00:02:3b:02:a6:eO), Dst: 00:00:el:08:8c:eb (00:00:el:08:8c:eb) 
a PPP-over-Ethernet Session 
a Point-to-Point Protocol 

a internet Protocol, src: 169.229.62.97 (169.229.62.97), Dst: 70.231.128.151 (70.231.128.151) 
a p'ransmission control^ protocol, src port: 6666 (6666)7 Dst port: 1043 (1043), Seq: 0, Ack: 1, Len: 0 
source port: 6666 (6666) 

Destination port: 1043 (1043) 

[stream index: 0] 

Sequence number: 0 (relative sequence number) 

Acknowledgement number: 1 (relative ack number) 

Header length: 44 bytes 
a Flags0x12' (SYN,_ ACK) 

Window size: 1460 
a Checksum: 0xd262 [correct] 
a Options: (24 bytes) 

Maximum segment size: 1460 bytes 

NOP 

Window scale: 2 (multiply by 4) 

NOP 

NOP 

Timestamps: TSval 488819609, TSecr 297077721 

NOP 

NOP 

TCP SACK Permitted Option: True 
a [secVack analysis] 
a [Timestamps] 


Figure 14-10 The SACK-Permitted option is exchanged in SYN segments to indicate the capability to gener¬ 
ate and process SACK information. Most modern TCPs support the MSS, Timestamps, Window 
Scale, and SACK-Permitted options during connection establishment. 


Figure 14-11 shows the series of events after the first SACK is received. Wire- 
shark indicates SACK information by indicating the left edge and right edge of 
the SACK range. Here we see that the ACK for 23801 contains a SACK block of 

[25201.26601] , indicating a hole at the receiver. The receiver is missing the sequence 
number range [23801,25200], which corresponds to the single 1400-byte packet 
starting with sequence number 23801. Note that this SACK is a window update 
and is not counted as a duplicate ACK for the reasons discussed earlier. It does not 
trigger fast retransmit. 

The SACK arriving at time 0.967 contains two SACK blocks: [28001,29401] and 

[25201.26601] . Recall that the first SACK blocks from previous SACKs are repeated 
in later positions in subsequent SACKs for robustness against ACK loss. This SACK 
is a duplicate ACK for sequence number 23801 and suggests that the receiver now 
requires two full-size segments starting with sequence numbers 23801 and 26601. 
The sender reacts immediately by initiating fast retransmit, but because of conges¬ 
tion control procedures (see Chapter 16), the sender sends only one retransmis¬ 
sion, for segment 23801. With the arrival of two additional ACKs, the sender is 
permitted to send its second retransmission, for segment 26601. 
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43 0.900865 70.231.128.151 169. 229. 62.9 TCP 

1043 > 6666 [ACK] Seq=40601 

Ack=l win=5808 

Len=1400 TS. 

45 0.932588 70.231.128.151 169. 229. 62.9 TCP 

1043 > 6666 TackT 560=42001 

Ack=l win=5808 

Len=1400 TS '' 

> 


S Frame 44: 86 bytes on wire (688 bits), 86 bytes captured (688 bits) 

a Ethernet II, Src: 00:02:3b:02:a6:e0 (00:02:3b:02:a6:eO), Dst: 00:00:el:08:8c:eb (00:00:el:08:8c:eb) 
a PPP-over-Ethernet Session 
a Point-to-point Protocol 

a internet Protocol, src: 169.229.62.97 (169.229.62.97), Dst: 70.231.128.151 (70.231.128.151) 
a iTransmission control^ Protocol, src Port: 6666 (6666)7 Dst Port: 1043 (1043)7’* Seq: 1, Ack:’^801, Len: 0 
Source port: 6666 (6666) 

Destination port: 1043 (1043) 

[stream index: 0] 

sequence number: 1 (relative sequence number) 

Acknowledgement number: 23801 (relative ack number) 

Header length: 44 bytes 
a Flags: 0x10 (ack) 

window size: 233016 (scaled) 
a Checksum: 0x5f55 [correct] 
a Options: (24 bytes) 

NOP 

NOP 

Timestamps: TSval 488819700, TSecr 297078328 

NOP 

NOP 

a sack: 25201-26601 

left edge = 25201 (relative) 
right edge = 26601 (relative) 
aL[SEQ/ACK analysis] 
a [Timestamps] 


Figure 14-11 The first ACK containing SACK information indicates an out-of-order block with 
sequence number range 25201 to 26601. 


A TCP SACK sender uses the recovery point idea introduced with NewReno. 
In this example, the highest sequence number sent prior to the retransmission is 
43400, which is lower than in the NewReno example from Figure 14-5. For this 
implementation of SACK fasf refransmif, fhree duplicafe ACKs are nof required; 
fhe TCP inifiafes ifs refransmission earlier. The recovery exif is essenfially fhe 
same, fhough. Once fhe ACK for sequence number 43401 is received af fime 1.3958, 
recovery is complefe. 

If is inferesfing fo nofe fhaf fhe pofenfial for beffer confrol of fhe sender using 
SACKs does nof always lead fo increased overall fhroughpuf performance. This 
facf is suggesfed by looking af fhe fwo examples we have seen. The NewReno 
(non-SACK) sender complefes fhe dafa fransfer of 131,074 byfes in 3.592s. The 
SACK sender complefes if in 3.674s. These fwo measuremenfs are nof direcfly com¬ 
parable, however, because fhey did nof face precisely fhe same nefwork condifions 
(fhis was nof a simulafion buf rafher a live fesf), alfhough fhe condifions were 
largely similar. The benefifs of SACKs are more pronounced when fhe RTT is large 
and packef loss is severe. Under such circumsfances, fhe benefifs of being able fo 
fill more fhan one hole per RTT are likely fo be more significanf. 
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14.7 Spurious Timeouts and Retransmissions 

Under a number of circumstances, TCP may initiate a retransmission even when 
no data has been lost. Such undesirable retransmissions are called spurious retrans¬ 
missions and are caused by spurious timeouts (timeouts firing too early) and other 
reasons such as packet reordering, packet duplication, or lost ACKs. Spurious 
timeouts can occur when the real RTT has recently increased significantly, beyond 
the RTO. This happens more frequently in environments where lower-layer pro¬ 
tocols have widely varying performance (e.g., wireless) and was a concern men¬ 
tioned in [KP87]. Here we focus primarily on spurious retransmissions caused by 
spurious timeouts. The effects of reordering and duplication on TCP are deferred 
until the following section. 

A number of approaches have been suggested to deal with spurious time¬ 
outs. They generally involve a detection algorithm and a response algorithm. The 
detection algorithm attempts to determine whether a timeout or timer-based 
retransmission was spurious. The response algorithm is invoked once a timeout 
or retransmission is deemed spurious. Its purpose is to undo or mitigate some 
action that is otherwise normally performed by TCP when a retransmission timer 
expires. In this chapter we discuss only the segment retransmission behavior. The 
response algorithms typically involve congestion control changes as well, and 
those aspects are discussed in Chapter 16. 

Figure 14-12 illustrates a highly simplified exchange that shows what happens 
to a basic TCP when a spurious retransmission occurs because of a delay spike in 
the ACK path after segment 8 is sent. After the retransmission of segment 5 occurs 
because of a timeout, there are still ACKs in flight from the original transmis¬ 
sions of segments 5 through 8. In this illustration, sequence and ACK numbers are 
based on packets instead of bytes, with ACKs indicating what has already arrived 
instead of what is expected next, for simplicity. When they arrive, TCP begins to 
retransmit additional segments that have already been received, starting with the 
segment following the ACKed segment. This causes TCP to behave in an unde¬ 
sirable "go-back-N" behavior pattern and in turn causes a collection of duplicate 
ACKs to be generated and returned to the sender, possibly triggering fast retrans¬ 
mit as well. Several techniques have been developed to mitigate these problems. 
We now have a look at some of the more popular ones. 

14.7.1 Duplicate SACK (DSACK) Extension 

With a non-SACK TCP, an ACK can indicate only the highest in-sequence segment 
back to the sender. With SACK, it can signal other (out-of-order) segments as well. 
The basic SACK mechanism we discussed previously does not say what happens 
when a receiver receives duplicate data segments. Such segments can be the result 
of spurious retransmissions, duplication within the network, or other reasons. 
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Figure 14-12 A delay spike occurs after the transmission of packet 8, causing a spurious retransmis¬ 
sion timeout and retransmission of packet 5. After retransmission, an ACK for the first 
copy of 5 arrives. The retransmission for 5 creates a duplicate packet at the receiver, fol¬ 
lowed by an undesirable "go-back-N" behavior whereby packets 6,7, and 8 are retrans¬ 
mitted even though they are already present at the receiver. 


DSACK or D-SACK, which stands for duplicate SACK [RFC2883], is a rule, 
applied at the SACK receiver and interoperable with conventional SACK senders, 
that causes the first SACK block to indicate the sequence numbers of a duplicate 
segment that has arrived at the receiver. The main purpose of DSACK is to deter¬ 
mine when a retransmission was not necessary and to learn additional facts about 
the network. With it, a sender has at least the possibility of inferring whether 
packet reordering, loss of ACKs, packet replication, and/or spurious retransmis¬ 
sions are taking place. 

The implementation of DSACK is compatible with conventional SACK in the 
sense that no separate negotiation is required to make use of it. For it to work 
properly, a change is made to the content of SACKs sent from the receiver and 
a corresponding change to the logic at the sender. If a non-DSACK TCP shares 
a connection with a DSACK TCP, they will interoperate, but without any of the 
benefits of DSACK. 

The change to the SACK receiver is to allow a SACK block to be included even 
if it covers sequence numbers below (or equal to) the cumulative ACK Number field. 
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This was not the original intent of SACK, but its capability is well matched to this 
purpose. (It applies equally well in cases where the DSACK information is above 
the cumulative ACK Number field; this happens for duplicated out-of-order seg¬ 
ments.) DSACK information is included in only a single ACK, and such an ACK 
is called a DSACK. DSACK information is not repeated across multiple SACKs as 
conventional SACK information is. As a consequence, DSACKs are less robust to 
ACK loss than regular SACKs. 

Exactly what a sender given DSACK information is supposed to do with it 
is not specified by [RFC2883]. An experimental algorithm is given in [RFC3708] 
for detecting spurious retransmissions using DSACK but does not provide any 
response algorithm. One option it mentions is to use the Eifel Response Algo¬ 
rithm, which we investigate in Section 14.7.4 after introducing a few other detec¬ 
tion algorithms. 

14.7.2 The Eifel Detection Algorithm 

At the beginning of this chapter, we discussed the retransmission ambiguity prob¬ 
lem. The experimental Eifel Detection Algorithm [RFC3522] deals with this problem 
using the TCP TSOPT to detect spurious retransmissions. After a retransmission 
timeout occurs, Eifel awaits the next acceptable ACK. If the next acceptable ACK 
indicates that the first copy of a retransmitted packet (called the original transmit) 
was the cause for the ACK, the retransmission is considered to be spurious. 

The Eifel Detection Algorithm is able to detect spurious behavior earlier than 
the approach using only DSACK because it relies on ACKs generated as a result 
of packets arriving before loss recovery is initiated. DSACKs, conversely, are able 
to be sent only after a duplicate segment has arrived at the receiver and able to be 
acted upon only after the DSACK is returned to the sender. Detecting spurious 
retransmissions early can offer advantages, because it allows the sender to avoid 
most of the go-back-N behavior mentioned earlier. 

The mechanics of the Eifel Detection Algorithm are simple. It requires the use 
of the TCP TSOPT. When a retransmission is sent (either a timer-based retransmis¬ 
sion or a fast retransmit), the TSV value is stored. When the first acceptable ACK 
covering its sequence number is received, the incoming ACK's TSER is examined. 
If it is smaller than the stored value, the ACK corresponds to the original transmis¬ 
sion of the packet and not the retransmission, implying that the retransmission 
must have been spurious. This approach is fairly robust to ACK loss as well. If an 
ACK is lost, any subsequent ACKs still have TSER values less than the stored TSV 
of the retransmitted segment. Thus, a retransmission can be deemed spurious as a 
result of any of the window's worth of ACKs arriving, so a loss of any single ACK 
is not likely to cause a problem. 

The Eifel Detection Algorithm can be combined with DSACKs. This can be 
beneficial in the situation where an entire window's worth of ACKs are lost but 
both the original transmit and retransmission have arrived at the receiver. In this 
particular case, the arriving retransmit causes a DSACK to be generated. The Eifel 
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Detection Algorithm would by default conclude that the retransmission is spuri¬ 
ous. It is thought, however, that if so many ACKs are being losf, allowing TCP 
fo believe fhe refransmission was not spurious is useful (e.g., fo induce if fo sfarf 
sending more slowly—a consequence of fhe congesfion confrol procedures we dis¬ 
cuss in Chapfer 16). Thus, arriving DSACKs cause fhe Eifel Defecfion Algorifhm fo 
conclude fhaf fhe corresponding refransmission is not spurious. 

14.7.3 Forward-RTO Recovery (F-RTO) 

Forward-RTO Recovery (F-RTO) [RFC5682] is a sfandard algorifhm for defecfing 
spurious refransmissions. If does nof require any TCP opfions, so when if is imple- 
menfed in a sender, if can be used effecfively even wifh an older receiver fhaf does 
nof supporf fhe TCP TSOPT. If affempfs fo defecf only spurious refransmissions 
caused by expirafion of fhe refransmission fimer; if does nof deal wifh fhe ofher 
causes for spurious refransmissions or duplicafions menfioned before. 

F-RTO makes a modificafion fo fhe acfion TCP ordinarily fakes affer a fimer- 
based refransmission. These refransmissions are for fhe smallesf sequence number 
for which no ACK has yef been received. Ordinarily, TCP confinues sending addi- 
fional adjacenf packefs in order as addifional ACKs arrive. This is fhe go-back-N 
behavior described previously. 

F-RTO modifies fhe ordinary behavior of TCP by having TCP send new (so far 
unsenf) dafa affer fhe fimeouf-based refransmission when fhe firsf ACK arrives. 
If fhen inspecfs fhe second arriving ACK. If eifher of fhe firsf fwo ACKs arriv¬ 
ing affer fhe refransmission was senf are duplicafe ACKs, fhe refransmission is 
deemed OK. If fhey are bofh accepfable ACKs fhaf advance fhe sender's window, 
fhe refransmission is deemed fo have been spurious. This approach is fairly infui- 
five. If fhe fransmission of new dafa resulfs in fhe arrival of accepfable ACKs, fhe 
arrival of fhe new dafa is moving fhe receiver's window forward. If such dafa is 
only causing duplicafe ACKs, fhere musf be one or more holes af fhe receiver. In 
eifher case, fhe recepfion of new dafa af fhe receiver does nof harm fhe overall dafa 
fransfer performance (provided fhere are sufficienf buffers af fhe receiver). 

14.7.4 The Eifel Response Algorithm 

The Eifel Response Algorithm [RFC4015] is a sfandard sef of operafions fo be exe¬ 
cuted by a TCP once a refransmission has been deemed spurious. Because fhe 
response algorifhm is logically decoupled from fhe Eifel Defecfion Algorifhm, if 
can be used wifh any of fhe defecfion algorifhms we jusf discussed. The Eifel 
Response Algorifhm was originally infended fo operafe for bofh fimer-based and 
fasf refransmif spurious refransmissions buf is currenfly specified only for fimer- 
based refransmissions. 

Alfhough fhe Eifel Response Algorifhm can be used wifh any of fhe defec¬ 
fion algorifhms, if behaves somewhaf differenfly based on whefher a spurious 
fimeouf was defecfed early (e.g., by fhe Eifel or F-RTO defecfion algorifhms) or 
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later (e.g., by DSACKs). The former cases are called spurious timeouts and operate 
by inspecting ACKs for original transmissions. The latter are called late spurious 
timeouts and are based on ACKs for retransmissions invoked as a result of (spuri¬ 
ous) timeouts. 

The response algorithm operates on the first retransmission timer event only, 
it is not executed if a subsequent timeout occurs before recovery is complete. After 
the retransmission timer expires, it takes a snapshot of the values in srtt and rttvar 
and records them in new variables srtt_prev and rttvar_prev as follows: 

srtt_prev = srtt + 2(G) 
rttvar_prev = rttvar 

These variables are assigned on any timer expiration but are used only when the 
timeout is determined to be spurious. If so, they help form the basis for setting 
the new RTO. In the formula, the value G represents the TCP clock granularity. 
srtt_prev is set to srtt plus twice the timer granularity based on the following chain 
of reasoning: The spurious timeout may have been invoked because the value of 
srtt is just a tad too small. If it were just a bit larger, no timeout would have hap¬ 
pened. Adding the term 2(G) to srtt deals with this situation by storing a slightly 
increased value into srtt_prev, which is used later for setting the RTO. 

After the srtt_prev and rttvar_prev values are stored, one of the detection algo¬ 
rithms is invoked. The result of running the algorithm produces a value assigned 
to a special variable called SpuriousRecovery. If the algorithm detects a spurious 
timeout, SpuriousRecovery is set to SPUR_TO. If it detects a late spurious timeout, it 
sets SpuriousRecovery to LATE_SPUR_TO. Otherwise, the timeout is not spurious, 
and ordinary TCP timeout processing continues. 

If SpuriousRecovery is SPUR_TO, TCP can take action before recovery is com¬ 
plete. It does this by adjusting the sequence number of the next segment it is about 
to send (called SND.NXT) to the first new, unsent segment (called SND.MAX). 
This avoids the undesirable go-back-N behavior after the initial retransmission 
discussed previously. If the detection algorithm detects a late spurious timeout, 
an ACK for the initial retransmission has already taken place, so SND.NXT is not 
changed. In either case, however, the congestion control state is reset (see Chapter 
16). In addition, once an acceptable ACK is received for a segment transmitted 
after the retransmission timer expires, the values of srtt, rttvar, and RTO can be 
updated as follows: 


srtt max(srtt_prev, m) 
rttvar max{rttvar_prev, m/2) 
RTO = srtt + max(G, A(rttvar)) 


Here, m is a sample of the RTT of the connection based on the arrival of the 
first acceptable ACK for data sent after the timeout. The motivation for these 
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modifications is that the real RTT may have changed so significantly that the RTT 
history in the current estimators is no longer a valid basis for setting the RTO. If 
the real path RTT has increased abruptly (e.g., because of wireless handoff to a 
new base station), the current srtt and rttvar values are likely to be too small and 
should be reinitialized. On the other hand, an increase in path RTT could be only 
temporary, implying that reinitializing srtt and rttvar might not be such a good 
idea because they are likely to be approximately correct. 

These equations try to balance between the two situations by reassigning the 
moving averages srtt and rttvar only if the new RTT samples are larger. Doing so 
effectively throws out the previous history of the RTT (and RTT variance). The val¬ 
ues of srtt and rttvar can only increase as a result of the response algorithm. If the 
RTT does not appear to be increasing, the running estimators remain unchanged, 
essentially ignoring the fact that a timeout has occurred. The RTO is reassigned 
in the conventional way in any case, and a new retransmission timer is set for this 
timeout value. 


14.8 Packet Reordering and Duplication 

Most of the issues discussed so far relate to how TCP handles packet loss. This 
is a relatively common issue, and a great deal of work has gone into making TCP 
robust to packet drops. As we began to see in the last section, other packet delivery 
anomalies such as duplication and reordering can also affect TCP's operation. In 
both of these cases, we wish TCP to be able to distinguish between packets that 
are reordered or duplicated and those that are lost. As we shall now see, this is 
sometimes not so simple. 

14.8.1 Reordering 

Packet reordering can occur in an IP network because IP provides no guarantee 
that relative ordering between packets is maintained during delivery. This can be 
beneficial (to IP at least), because IP can choose another path for traffic (e.g., that is 
faster) without having to worry about the consequences that doing so may cause 
traffic freshly injected into the network to pass ahead of older traffic, resulting in 
the order of packet arrivals at the receiver not matching the order of transmission 
at the sender. There are other reasons packet reordering may occur. For example, 
some high-performance routers employ multiple parallel data paths within the 
hardware [BPS99], and different processing delays among packets can lead to a 
departure order that does not match the arrival order. 

Reordering may take place in the forward path or the reverse path of a TCP 
connection (or in some cases both). The reordering of data segments has a some¬ 
what different effect on TCP as does reordering of ACK packets. Recall that 
because of asymmetric routing, it is frequently the case that ACKs travel along 
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different network links (and through different routers) from data packets on the 
forward path. 

When traffic is reordered, TCP can be affected in several ways. If reordering 
takes place in the reverse (ACK) direction, it causes the sending TCP to receive 
some ACKs that move the window significantly forward followed by some evi¬ 
dently old redundant ACKs that are discarded. This can lead to an unwanted 
burstiness (instantaneous high-speed sending) behavior in the sending pattern 
of TCP and also trouble in taking advantage of available network bandwidth, 
because of the behavior of TCP's congestion control (see Chapter 16). 

If reordering occurs in the forward direction, TCP may have trouble distin¬ 
guishing this condition from loss. Both loss and reordering result in the receiver 
receiving out-of-order packets that create holes between the next expected packet 
and the other packets received so far. When reordering is moderate (e.g., two adja¬ 
cent packets switch order), the situation can be handled fairly quickly. When reor¬ 
derings are more severe, TCP can be tricked into believing that data has been 
lost even though it has not. This can result in spurious retransmissions, primarily 
from the fast retransmit algorithm. 

Recall from previous discussions that the fast retransmit algorithm relies 
on observing duplicate acknowledgments from a TCP receiver in order to infer 
the loss of a packet and to initiate a retransmission without having to wait for a 
retransmission timer to expire. Because a TCP receiver is supposed to immedi¬ 
ately ACK any out-of-sequence data it receives in order to help induce fast retrans¬ 
mit to be triggered on packet loss, any packet that is reordered within the network 
causes a receiver to produce a duplicate ACK. If fast retransmit were to be invoked 
whenever any duplicate ACK is received at the sender, a large number of unnec¬ 
essary retransmissions would occur on network paths where a small amount of 
reordering is common. To handle this situation, fast retransmit is triggered only 
after the duplicate threshold {dupthresh) has been reached. 

The effect is illustrated in Figure 14-13. The left portion of the figure indicates 
how TCP behaves with light reordering, where dupthresh is set to 3. In this case, the 
single duplicate ACK does not affect TCP It is effectively ignored and TCP over¬ 
comes the reordering. The right-hand side indicates what happens when a packet 
has been more severely reordered. Because it is three positions out of sequence, 
three duplicate ACKs are generated. This invokes the fast retransmit procedure in 
the sending TCP, producing a duplicate segment at the receiver. 

The problem of distinguishing loss from reordering is not trivial. Dealing 
with it involves trying to decide when a sender has waited long enough to try to 
fill apparent holes at the receiver. Fortunately, severe reordering on the Internet is 
not common [J03], so setting dupthresh to a relatively small number (such as the 
default of 3) handles most circumstances. That said, there are a number of research 
projects that modify TCP to handle more severe reordering [LLY07]. Some of these 
adjust dupthresh dynamically, as does the Linux TCP implementation. 
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Figure 14-13 Mild reordering (left) is overcome by ignoring a small number of duplicate ACKs. 

When reordering is more severe (right), as in this case where packet 4 is three places 
out of sequence, a spurious fast retransmit can be triggered. 


14.8.2 Duplication 

Although rare, the IP protocol may deliver a single packet more than one time. 
This can happen, for example, when a link-layer network protocol performs a 
refransmission and creafes fwo copies of fhe same packef. When duplicafes are 
creafed, TCP can become confused in some of fhe ways we have seen already. 
Consider fhe case shown in Figure 14-14 in which packef number 3 has been dupli- 
cafed fhree fimes. 



Figure 14-14 Packet duplication in the network has caused a spurious fast retransmission due to the 
presence of duplicate ACKs. 
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As we can see, the effect of packet 3 being duplicated is to produce a series 
of duplicate ACKs from the receiver. This is enough to trigger a spurious fast 
retransmit, as the non-SACK sender may mistakenly believe that packets 5 and 
6 have arrived earlier. With SACK (and DSACK, in particular) this is more easily 
diagnosed at the sender. With DSACK, each of the duplicate ACKs for A3 con¬ 
tains DSACK information that segment 3 has already been received. Furthermore, 
none of them contains an indication of any out-of-order data, meaning the arriv¬ 
ing packets (or their ACKs) must have been duplicates. TCP can often suppress 
spurious retransmissions in such cases. 


14.9 Destination Metrics 

As we have seen, TCP "learns" the characteristics of the network path between 
the sender and the receiver over time. The learning is kept in state variables at 
the sender such as srtt and rttvar. Some TCP implementations also keep track of 
an estimate of the amount of packet reordering that has occurred recently along 
a path. Plistorically, this learning is lost once the connection is closed. That is, if 
a new TCP connection is opened to the same receiver, it must start to determine 
values for the state variables from scratch. 

Newer TCP implementations maintain many of the metrics that we have 
described in this chapter in a routing or forwarding table entry or other system- 
wide data structure that exists even after TCP connections are closed. When a 
new connection is created, TCP consults the data structure to see if there is any 
preexisting information regarding the path to the destination host with which it 
will be communicating. If so, initial values for srtt, rttvar, and so on can be initial¬ 
ized to some value based on previous, relatively recent experience. When a TCP 
connection closes down, it has the opportunity to update the statistics. This can be 
accomplished by replacing the existing statistics or updating them in some other 
way. In the case of Linux 2.6, the values are updated to be the maximum of the 
existing values and those measured by the most recent TCP. These values can be 
inspected using the ip program available from the iproute2 suite of tools [IPR2]: 

Linux% ip route show cache 132.239.50.184 

132.239.50.184 from 10.0.0.9 tos 0x10 via 10.0.0.1 dev ethO 

cache mtu 1500 rtt 29ms rttvar 29ms cwnd 2 advmss 1460 hoplimit 64 


This command shows information cached about previous connections with a 
particular DSCP value (16, indicating CS2 but represented using the older "ToS" 
byte terminology with value 0x10) between the local system and 132.239.50.184 
using the IPv4 next hop 10.0.0.1 and accessed using the network device ethO. 
We can see packet size information (the path MTU learned with PMTUD, the MSS 
advertised by the remote side), the maximum number of hops to use (for IPv6; not 
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applicable here), values of srtt and rttvar, along with congestion control informa¬ 
tion such as cwnd that we discuss in Chapter 16. 


14.10 Repacketization 

When TCP times out and retransmits, it does not have to retransmit the identi¬ 
cal segment. Instead, TCP is allowed to perform repacketization, sending a bigger 
segment, which can increase performance. (Naturally, this bigger segment cannot 
exceed the MSS announced by the receiver and should not exceed the path MTU.) 
This is allowed in the protocol because TCP identifies the data being sent and 
acknowledged by its byte number, not its segment (or packet) number. 

TCP's ability to retransmit a segment with a different size from the original 
segment provides another way of addressing the retransmission ambiguity prob¬ 
lem. This has been the basis of an idea called STODER [TZZ05] that uses repack¬ 
etization to detect spurious timeouts. 

We can easily see repacketization in action. We use our sock program as a 
server and connect to it with Telnet. First we type the line hello there. This 
produces a segment of 13 data bytes, including the carriage-return and newline 
characters produced when the Enter key is pressed. We then disconnect the net¬ 
work and type line number 2 (14 bytes, including the newline). We then wait 
about 45s, type and 3, and terminate the connection: 

Linux% telnet 169.229.62.97 6666 

hello there (first line gets sent OK) 

(then we disconnect the Ethernet cable) 
line number 2 (this line gets retransmitted) 

and 3 (reconnect Ethernet) 

*] telnet> quit 


We can see the results using tcpdumpit 


I 19:51:47.674418 IP 10.0.0.7.1029 > 169.229.62.97.6666: 

P 1:14{13) ack 1 win 5840 •< - 

<nop,nop,timestamp 2343578137 596377728> 


"hello there\r\n" 


2 19:51:47.788992 IP 169.229.62.97.6666 > 10.0.0.7.1029: 

. ack 14 win 58254 <nop,nop,timestamp 596378252 2343578137> 


3 19:52:35.130837 IP 10.0.0.7.1029 > 169.229.62.97.6666: 

FP 29:36(7) ack 1 win 5840 -4- 

<nop,nop,timestamp 2343602439 596378252> 


"and 3\r\n" 


4 19:52:35.146358 IP 169.229.62.97.6666 > 10.0.0.7.1029: 
. ack 14 win 58254 

<nop,nop,timestamp 596382987 2343578137,nop,nop, 
sack sack 1 {29:36}> 
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5 19:52:39.414253 IP 10.0.0.7.1029 > 169.229.62.97.6666: 

FP 14:36 (22) ack 1 win 5840 ■<- 

<nop,nop,timestamp 2343604633 596382987> 

6 19:52:39.429228 IP 169.229.62.97.6666 > 10.0.0.7.1029: 

. ack 37 win 58248 <nop,nop,timestamp 596383416 2343604633> 

7 19:52:39.429696 IP 169.229.62.97.6666 > 10.0.0.7.1029: 

F 1:1(0) ack 37 win 58254 

<nop,nop,timestamp 596383416 2343604633> 

8 19:52:39.430119 IP 10.0.0.7.1029 > 169.229.62.97.6666: 

. ack 2 win 5840 <nop,nop,timestamp 2343604641 596383416> 

In this trace, the initial SYN exchange has been removed. The first two seg¬ 
ments contain the data strings hello there and its acknowledgment. The next 
packet in the trace is not in sequence: it starts with sequence number 29 and con¬ 
tains the string and 3 (7 bytes). Its returning ACK contains ACK number 14 but 
a SACK block with relative sequence numbers {29,36}. The middle sequence of 
characters has been lost. TCP retransmits this but uses a larger packet, containing 
sequence numbers 14:36. Thus, we can see how the retransmission for sequence 
number 14 resulted in a repacketization to form a larger packet of size 22 bytes. 
Interestingly, this packet overlaps the data present in the SACK block and also car¬ 
ries the FIN bit field, indicating that it is the last data of the connection. 


“line number2\r\n 
and 3\r\n” 


14.11 Attacks Involving TCP Retransmission 

There is a class of DoS attack called low-rate DoS attacks [KK03]. In such an attack, 
an attacker sends bursts of traffic to a gateway or host, causing the victim sys¬ 
tem to experience a retransmission timeout. Given an ability to predict when the 
victim TCP will attempt to retransmit, the attacker generates a burst of traffic at 
each retransmission attempt. As a consequence, the victim TCP perceives conges¬ 
tion in the network, throttles its sending rate to near zero, keeps backing off its 
RTO according to Karn's algorithm, and effectively receives very little network 
throughput. The proposed mechanism to deal with this type of attack is to add 
randomization to the RTO, making it difficult for the attacker to guess the precise 
times when a retransmission will take place. 

A related but distinct form of DoS attack involves slowing a victim TCP's seg¬ 
ments down so that the RTT estimate is too high. Doing so causes the victim TCP 
to be less aggressive in retransmitting its own packets when they are lost. The 
opposite attack is also possible: an attacker forges ACKs when data has been trans¬ 
mitted but has not actually arrived at the receiver yet. In this case, the attacker can 
cause the victim TCP to believe that the connection RTT is significantly smaller 
than it really is, leading to an overaggressive TCP that creates numerous unwanted 
retransmissions. 
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14.12 Summary 

This chapter provided a detailed look at TCP's timeout and retransmission strat¬ 
egy. Our first example illustrated a case in which we simply unplugged the net¬ 
work when a TCP had a packet to send. This resulted in a retransmission timer 
initiating a timeout-based retransmission. Each successive retransmit took place 
at an interval twice as long as the previous transmission, the result of fhe second 
parf of Karn's algorifhm fhaf incorporafes binary exponenfial backoff. 

TCP measures fhe RTT and fhen uses fhese measuremenfs fo keep frack of a 
smoofhed RTT esfimafor and a smoofhed mean deviafion esfimafor. These fwo 
esfimafors are fhen used fo calculafe fhe nexf refransmission fimeouf value. Wifh- 
ouf fhe Timesfamps opfion, a TCP measures only a single RTT per window of dafa. 
Karn's algorifhm removes fhe refransmission ambiguify problem by prevenfing 
fhe use of RTT measuremenfs for segmenfs fhaf have been losf. Today, mosf TCPs 
use fhe Timesfamps opfion, which permifs each segmenf fo be individually fimed. 
The Timesfamps opfion operafes correcfly even in fhe face of packef reordering or 
packef duplicafion. 

We looked af fhe fasf refransmif algorifhm, which can be friggered wifhouf 
requiring a fimer fo expire. This is fhe mosf efficienf mefhod (and fhe mosf fre- 
quenfly used one) for TCP fo fill holes af fhe receiver caused by missing packefs. 
Fasf refransmif can be improved wifh fhe use of selecfive ACKs. These carry addi- 
fional informafion in fhe ACKs and permif fhe SACK-capable TCP sender fo repair 
more fhan one hole per RTT. Doing so can lead fo improved performance under 
some circumsfances. 

If fhe RTT esfimafe is below fhe acfual RTT of fhe connecfion, a spurious 
refransmission may fake place. In such cases, if TCP waifed a liffle longer, fhe 
(unnecessary) refransmission would nof happen. A number of algorifhms have 
been developed fo defecf when a TCP has experienced a spurious fimeouf. The 
DSACK approach requires fhe arrival of a duplicafe segmenf af fhe receiver. The 
Eifel Defecfion Algorifhm depends on TCP fimesfamps buf can reacf fasfer fhan 
DSACKs because if defecfs spurious fimeoufs based on ACKs refurning from seg¬ 
menfs fhaf were senf prior fo fhe fimeouf. F-RTO is anofher algorifhm fhaf behaves 
similarly fo Eifel buf does nof require fimesfamps. If also changes fhe sender fo 
send new dafa affer a fimeouf fhaf is deemed fo be spurious. All of fhese defecfion 
algorifhms can be combined wifh a response algorifhm. The main one described 
so far is fhe Eifel Response Algorifhm, which can resef RTT and RTT variance 
esfimafes if fhe delay has increased subsfanfially (and ofherwise "undoes" any 
changes TCP would ofherwise perform on a fimeouf). 

We also looked af how TCP sfafe can be cached across connecfions, how TCP 
is allowed fo repackefize ifs dafa, and some affacks fhaf can be mounfed fo fool 
TCP info behaving in undesired ways such as being too passive or aggressive. We 
shall see more abouf fhe consequences of fhese affacks in Chapfer 16, where we 
invesfigafe TCP's congesfion confrol procedures. 
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15.1 Introduction 

Chapter 13 dealt with the establishment and termination of TCP connections, and 
Chapter 14 examined how TCP ensures reliable delivery using retransmissions 
of dafa fhaf has been losf. We now examine fhe dynamics of TCP dafa fransfers, 
focusing inifially on inferacfive connecfions and fhen infroducing flow confrol 
and associafed window managemenf procedures fhaf are used in conjuncfion 
wifh congesfion confrol (see Chapfer 16) for bulk dafa fransfers. 

An "inferacfive" TCP connecfion is one in which user inpuf such as keysf rokes, 
shorf messages, or joysfick/mouse movemenfs need fo be delivered befween a cli- 
enf and a server. If small segmenfs are used fo carry such user inpuf, fhe profocol 
imposes more overhead because fhere are fewer useful payload byfes per packef 
exchanged. On fhe ofher hand, filling packefs wifh more dafa usually requires 
fhem fo be delayed, which can have a negafive impacf on delay-sensifive appli- 
cafions such as online games and collaborafion fools. We shall invesfigafe fech- 
niques wifh which fhe applicafion can frade off befween fhese fwo issues. 

Affer discussing inferacfive communicafions, we discuss fhe mefhods used 
by TCP for achieving flow confrol by dynamically adapfing fhe window size fo 
ensure fhaf a sender does nof overrun a receiver. This issue primarily impacfs 
bulk dafa fransfer (i.e., noninferacfive communicafions) buf can also affecf infer¬ 
acfive applicafions. In Chapfer 16 we will explore how fhe concepf of flow confrol 
can be exfended fo profecf nof only fhe receiver, buf also fhe nefwork befween fhe 
sender and fhe receiver. 
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15.2 Interactive Communication 

The amount of network traffic carried in a particular portion of the Internet over 
a certain amount of time is usually measured in terms of bytes or packets. There 
is considerable variation in these numbers. For example, local area traffic differs 
from wide area traffic, and traffic between different sites tends to vary. Studies 
of TCP traffic [P05][F03] usually find that 90% or more of all TCP segments con¬ 
tain bulk data (e.g., Web, file sharing, electronic mail, backups) and the remain¬ 
ing portion contains interactive data (e.g., remote login, network games). Bulk data 
segments tend to be relatively large (1500 bytes or larger), while interactive data 
segments tend to be much smaller (tens of bytes of user data). 

TCP handles both types of data using the same protocol and packet format, 
but different algorithms come into play for each. In this section, we will look at 
how interactive data is transferred by TCP, using the ssh (secure shell) application 
as one example. Secure shell [RFC4251] is a remote login protocol that provides 
strong security (privacy and authentication based on cryptography). It has mostly 
replaced the earlier UNIX rlogin and Telnet programs that provide remote login 
service but without strong security. 

As we investigate ssh, we will see how delayed acknowledgments work and 
how the Nagle algorithm reduces the number of small packets across wide area net¬ 
works. The same algorithms apply to other applications supporting remote login 
capability such as Telnet, rlogin, and Windows Terminal Services. 

Let us look at the flow of data when we type an interactive command on an 
ssh connection. The client captures what the user types and ships it over to the 
server to be interpreted, and the server ships any responses back to the client. The 
client encrypts the data it sends, meaning that the characters typed by the user are 
encoded before being transferred over the connection (see Chapter 18). The encod¬ 
ing makes determining the typed keys difficult for an eavesdropper. The client 
supports several encryption algorithms and different authentication methods. It 
also supports several other advanced features such as tunneling other protocols 
(see Chapter 3 and [RFC4254]). 

Many newcomers to TCP/IP are surprised to find that each interactive key¬ 
stroke normally generates a separate data packet. That is, the keystrokes are sent 
from the client to the server individually (one character at a time rather than one 
line at a time). Furthermore, ssh invokes a shell (command interpreter) on the 
remote system (the server), which echoes the characters that are typed at the cli¬ 
ent. A single typed character could thus generate four TCP segments: the inter¬ 
active keystroke from the client, an acknowledgment of the keystroke from the 
server, the echo of the keystroke from the server, and an acknowledgment of the 
echo from the client back to the server (see Figure 15-l(a)). 

Normally, however, segments 2 and 3 are combined—in Figure 15-l(b), the 
acknowledgment of the keystroke is sent along with the echo of the characters 
typed. We describe the technique that combines these (called delayed acknowledg¬ 
ments with piggybacking) in the next section. 
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Figure 15-1 One possible way to remotely echo an interactive keystroke is a separate ACK and echo 
packet (a). A typical TCP coalesces the ACK for the data byte and the echo of the byte 
into a single packet (b). 


We purposely use ssh for this example because it generates a packet for each 
character typed from the client to the server. If the user types especially fast, 
however, more than one character might be carried in a single packet. Figure 15-2 
shows the flow of data using Wireshark when we type the date command across 
an active ssh connection to a Linux server. 
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Figure 15-2 TCP segments sent when the date command is typed on an already-established ssh connection. 


In Figure 15-2, packet 1 carries the character d from the client to the server. 
Packet 2 is the acknowledgment of this character and its echo (combining the mid¬ 
dle two segments as in Figure 15-1). Packet 3 is the acknowledgment of the echoed 
character. Packets 4-6 correspond to the character a, packets 7-9 to the character t, 
and packets 10-12 to the character e. Packets 13-15 correspond to the Enter (carriage 






694 


TCP Data Flow and Window Management 


return) key. The delays between packets 3-4, 6-7, 9-10, and 12-13 are the human 
delays between typing each character, which were intentionally made unusually 
long (about 1.5s) in this case for illustration. 

Notice that packets 16-19 are slightly different because they have grown in 
size from 48 byfes fo 64 byfes. Packef 16 confains fhe oufpuf of fhe date command 
from fhe server. The 64 byfes are fhe encrypfed version of fhe following 28 clear- 
fexf (nof-yef-encrypfed) characfers: 

Wed Dec 28 22:47:16 PST 2005 

plus fhe carriage-refurn and line-feed characfers af fhe end. The nexf packef senf 
from fhe server fo fhe clienf (packef 18) confains fhe clienf's prompf on fhe server 
hosf: Linux%. Packef 19 acknowledges fhis dafa. 

Figure 15-3 is fhe same frace as in Figure 15-2, excepf now more of fhe TCP- 
layer informafion is shown, indicafing how TCP acknowledgmenfs operafe and fhe 
packef sizes used by ssh. Packef 1 (confaining fhe d characfer) sfarfs wifh fhe relafive 
sequence number 0. Packef 2 ACKs fhe packef from line 1 by setting fhe ACK number 
fo 48, fhe sequence number of fhe lasf successfully received byfe plus 1. Packef 2 also 
sends fhe dafa byfe wifh a sequence number of 0 from fhe server fo fhe clienf, confain¬ 
ing fhe echo of fhe d characfer. The echoed d is ACKed by fhe clienf in packef 3 by sef- 
fing fhe ACK number fo 48. We see fhaf fhe connecfion has fwo sfreams of sequence 
numbers in use—one from fhe clienf fo fhe server, and one in fhe reverse direcfion. 
We shall explore fhis in more defail when we discuss window adverfisemenfs. 
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Figure 15-3 The same trace as in Figure 15-2, except the protocol decode for ssh has been disabled, revealing 
the TCP sequence number information. Note that all data packets are 48 bytes in size except the 
last two. The size of 48 bytes relates to the cryptography used in ssh (see Chapter 18). 


One other observation we can make about this trace is that each packet with 
data in it (not zero length) also has the PSH bit field set. As mentioned earlier, 
this flag is conventionally used to indicate that the buffer at the side sending the 
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packet has been emptied in conjunction with sending the packet. In other words, 
when the packet with the PSH bit field set left the sender, the sender had no more 
data to send. 


15.3 Delayed Acknowledgments 

In many cases, TCP does not provide an ACK for every incoming packet. This is 
possible because of TCP's cumulative ACK field (see Chapter 12). Using a cumula¬ 
tive ACK allows TCP to intentionally delay sending an ACK for some amount of 
time, in the hope that it can combine the ACK it needs to send with some data the 
local application wishes to send in the other direction. This is a form of piggyback¬ 
ing that is used most often in conjunction with bulk data transfers. Obviously a 
TCP cannot delay ACKs indefinitely; otherwise its peer could conclude that data 
has been lost and initiate an unnecessary retransmission. 


Note 

The Host Requirements RFC [RFC1122] states that TCP should implement a 
delayed ACK but the delay must be less than 500ms. Many implementations use 
a maximum of 200ms. 


Delaying ACKs causes less traffic to be carried over the network than when 
ACKs are not delayed because fewer ACKs are used. A ratio of 2 to 1 is fairly com¬ 
mon for bulk transfers. The use of delayed ACKs and the maximum amount of 
time TCP is allowed to wait before sending an ACK can be configured, depend¬ 
ing on the host operating system. Linux uses a dynamic adjustment algorithm 
whereby it can change between ACKing every segment (called "quickack" mode) 
and conventional delayed ACK mode. On Mac OS X, the system variable net. 
inet. tcp. delayed_ack determines how delayed ACKs are to be used. The val¬ 
ues work as follows: disable delay (0), always delay (1), ACK every other packet 
(2), and autodetect when to respond (3). The default is 3. On recent versions of 
Windows, the registry entries under 


HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\JG 


(where IG refers to the GUID of the particular network interface being referenced) 
for each interface GUID work a bit differently. The value for TcpAckFrequency 
(which needs to be added) can range from 0 to 255 and defaults to 2. It determines 
the number of AGKs outstanding before the delayed AGK timer is ignored. Setting 
the value to 1 effectively causes AGKs to be generated for every segment received. 
The AGK timer, when used, canbe controlled with the TcpDelAckTicks registry 
entry. This value can be set in the range from 2 to 6 and defaults to 2. It is the num¬ 
ber of hundreds of milliseconds to wait before sending a delayed AGK. 
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For the reasons mentioned earlier, TCP is generally set up to delay ACKs 
under certain circumstances, but not to delay them too long. We will see extensive 
use of delayed ACKs in Chapter 16, when we look at how TCP's congestion control 
behaves during bulk transfers wifh large packefs. When smaller packefs are used, 
such as for inferacfive applicafions, anofher algorifhm comes info play The com- 
binafion of fhis algorifhm wifh delayed ACKs can lead fo poor performance if nof 
handled carefully, so we will now look af if in more defail. 


15.4 Nagle Algorithm 

We saw in fhe previous secfion fhaf as liffle as one keysfroke af a fime offen flows 
from fhe clienf fo fhe server across an ssh connecfion. When using IPv4, sending 
one single key press generafes TCP/IPv4 packefs of abouf 88 byfes in size (using 
fhe encrypfion and aufhenficafion from fhe example): 20 byfes for fhe IP header, 
20 byfes for fhe TCP header (assuming no opfions), and 48 byfes of dafa. These 
small packefs (called tinygrams) have a relafively high overhead for fhe nefwork. 
Thaf is, fhey confain relafively liffle useful applicafion dafa compared fo fhe resf 
of fhe packef confenfs. Such high-overhead packefs are normally nof a problem on 
LANs, because mosf LANs are nof congesfed and such packefs would nof need fo 
be carried very far. However, fhese finygrams can add fo congesfion and lead fo 
inefficienf use of capacify on wide area nefworks. A simple and eleganf solufion 
was proposed by John Nagle in [RFC0896], now called fhe Nagle algorithm. Firsf we 
will describe how if operafes, and fhen we will discuss some piffalls and problems 
fhaf can occur as a resulf of using if wifh delayed ACKs. 

The Nagle algorifhm says fhaf when a TCP connecfion has oufsfanding dafa 
fhaf has nof yef been acknowledged, small segmenfs (fhose smaller fhan fhe SMSS) 
cannof be senf unfil all oufsfanding dafa is acknowledged. Insfead, small amounfs 
of dafa are collecfed by TCP and senf in a single segmenf when an acknowledg- 
menf arrives. This procedure effecfively forces TCP info stop-and-wait behavior—if 
sfops sending unfil an ACK is received for any oufsfanding dafa. The beaufy of 
fhis algorifhm is fhaf if is self-clocking: fhe fasfer fhe ACKs come back, fhe fasfer fhe 
dafa is senf. On a comparafively high-delay WAN, where reducing fhe number of 
finygrams is desirable, fewer segmenfs are senf per unif fime. Said anofher way, 
fhe RTT confrols fhe packef sending rafe. 

We saw in Figure 15-3 fhaf fhe RTT for a single byfe fo be senf, acknowledged, 
and echoed can be small (under 15ms). To generafe dafa fasfer fhan fhis we would 
have fo type more than 60 characters per second. This means that we rarely 
encounter any observable effects of fhis algorifhm when sending dafa befween 
fwo hosfs wifh a small RTT, such as when fhey are on fhe same LAN. 

To illusfrafe fhe effecf of fhe Nagle algorifhm, we can compare fhe behaviors 
of an applicafion using TCP wifh fhe Nagle algorifhm enabled and disabled. We 
modify a version of fhe ssh clienf for fhis purpose. Using a connecfion wifh a rela¬ 
fively large RTT of abouf 190ms, we can see fhe differences. Firsf, we examine fhe 
case when Nagle is disabled (fhe defaulf for ssh), as shown in Figure 15-4. 
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Figure 15-4 An ssh trace showing a TCP connection with approximately a 190ms RTT. The Nagle 
algorithm is disabled. Transmissions and ACKs are intermingled, and the exchange 
takes 0.58s using 19 packets. Many packets are relatively small (48 bytes of user data). 
Pure ACKs (segments with no data) indicate that command output at the server has 
been processed by the client. 


The trace in Figure 15-4 begins after the initial authentication protocol has 
completed and the login session has begun. The date command is then typed. 
We see that 19 packets are captured, and the entire exchange lasts 0.58s. There are 
five ssh requesf packefs, seven ssh response packefs, and seven TCP-level pure 
ACKs (no dafa). If we repeaf fhis measuremenf soon affer (i.e., in similar nefwork 
condifions), buf insfead leave fhe Nagle algorifhm enabled, we see fhe behavior 
shown in Figure 15-5. 

We can see immediafely fhaf fhe number of packefs in Figure 15-5 is smaller 
fhan in Figure 15-4 (by eighf). The ofher sfriking difference is fhe regularify of 
how fhe requesfs and responses are ordered and separafed by fime. Recall fhaf fhe 
Nagle algorifhm forces TCP fo operafe in a sfop-and-waif fashion, so fhaf fhe TCP 
sender cannof proceed unfil ACKs are received. If we look af fhe fimes for each 
requesf/response pair—0.0, 0.19, 0.38, and 0.57—^we see fhaf fhey follow a paffern; 
each is separafed by almosf exacfly 190ms, which is very close fo fhe RTT of fhe 
connecfion. The consequence of having fo waif one RTT for each requesf/response 
adds fo fhe overall fime fo complefe fhe exchange (0.80s insfead of fhe 0.58s when 
Nagle was disabled). This is fhe frade-off fhe Nagle algorifhm makes: fewer and 
larger packefs are used, buf fhe required delay is higher. The differenf behaviors 
can be seen even more clearly in Figure 15-6. 

The effecf of fhe Nagle algorifhm's sfop-and-waif behavior can be seen clearly 
in Figure 15-6. The exchange on fhe leff side keeps bofh direcfions of fhe connec¬ 
fion busy, while wifh fhe Nagle algorifhm enabled only one direcfion of fhe con¬ 
necfion is busy af any given fime. 
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Figure 15-5 An ssh trace showing a TCP connection with a 190ms RTT and the Nagle algorithm in 
operation. Requests are followed in lockstep with responses, and the exchange takes 
0.80s using 11 packets. 
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15.4.1 Delayed ACK and Nagle Algorithm Interaction 

If we consider what happens when the delayed ACK and Nagle algorithms are 
used together, we can construct an undesirable scenario. Consider a client using 
delayed ACKs that sends a request to a server, and the server responds with an 
amount of data that does not quite fit inside a single packet (see Figure 15-7). 


Client 

(Delaying ACKs) 


Server 
(Using Nagle) 


•Request- 

--ACK - 


Delayed 

ACK 

Timer 


t Response packet-^ 

Response Packet I 


Full-size Packet 
Small Packet 


ACK- 


Server Cannot Send 
Additional Responses 
Until ACK Is Received 


L Response packet-^ 


Figure 15-7 The interaction between the Nagle algorithm and delayed ACKs. A temporary form of 
deadlock can occur until the delayed ACK timer fires. 


Here we see that the client, after receiving two packets from the server, with¬ 
holds an ACK, hoping that additional data headed toward the server can be piggy¬ 
backed. Generally, TCP is required to provide an ACK for two received packets 
only if they are full-size, and they are not here. At the server side, because the 
Nagle algorithm is operating, no additional packets are permitted to be sent to the 
client until an ACK is returned because at most one "small" packet is allowed to 
be outstanding. The combination of delayed ACKs and the Nagle algorithm leads 
to a form of deadlock (each side waiting for the other) [MMSV99][MM01]. Fortu¬ 
nately, this deadlock is not permanent and is broken when the delayed ACK timer 
fires, which forces the client to provide an ACK even if the client has no additional 
data to send. However, the entire data transfer becomes idle during this deadlock 
period, which is usually not desirable. The Nagle algorithm can be disabled in 
such circumstances, as we saw with ssh. 

15.4.2 Disabling the Nagle Algorithm 

As we might conclude from the previous example, there are times when the Nagle 
algorithm needs to be turned off. Typical examples include cases where as little 
delay as possible is required, for example, when a mouse movement or keystroke 
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must be delivered without delay to provide real-time feedback for a user whose 
display is handled remofely Anofher example is in mulfiplayer online games, 
where characfer movemenfs musf be delivered as quickly as possible so as fo nof 
inferfere wifh proper causalify in fhe game (and fo nof delay if foo much for ofher 
players). 

The Nagle algorifhm can be disabled in a number of ways. The abilify fo dis¬ 
able if is required by fhe Hosf Requiremenfs RFC [RFC1122]. An applicafion can 
specify fhe TCP_NODELAY opfion when using fhe Berkeley sockefs API. In addi- 
fion, if is possible fo disable fhe Nagle algorifhm on a sysfem-wide basis. In Win¬ 
dows, fhis can be accomplished using fhe following regisfry key: 


HKLM\SOFTWARE\Microsoft\MSMQ\Parameters\TCPNoDelaY 


This DWORD value, which musf be added by fhe user, should be sef fo fhe value 1 
in order fo disable fhe Nagle algorifhm. Message Queuing may have fo be insfalled 
for fhis change fo be effecfive [MMQ]. 


15.5 Flow Control and Window Management 

Recall from Chapfer 12 fhaf a variable sliding window can be used fo implemenf 
flow confrol. In Figure 15-8, a TCP clienf and server are inferacfing, providing each 
ofher wifh informafion abouf fhe dafa flow, including segmenf sequence numbers, 
ACK numbers, and window sizes (i.e., available space af fhe receiver). 



Figure 15-8 Each TCP connection is bidirectional. Data going in one direction causes the peer to respond 
with ACKs and window advertisements. The same is true for the reverse direction. 
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The two large arrows in Figure 15-8 indicate the direction of data flow (fhe 
direcfion in which TCP segmenfs are senf). Recalling fhaf every TCP connecfion 
has dafa flowing in bofh direcfions, we have fwo arrows, one in fhe clienf-fo- 
server direcfion (C—»S) and anofher in fhe server-fo-clienf direcfion (S—»C). Every 
segmenf confains ACK and window informafion and may also confain some user 
dafa. The fields used in fhe TCP header are shaded based on fhe direcfion of dafa 
flow fhey describe. For example, dafa flowing in fhe C—direcfion is included 
in segmenfs flowing along fhe boffom arrow, buf fhe ACK number and window 
adverfisemenf for fhis dafa are refurned in segmenfs following fhe fop arrow. 
Every TCP segmenf (excepf fhose exchanged during connecfion esfablishmenf) 
includes a valid Sequence Number field, an ACK Number or Acknowledgment field, 
and a Window Size field (confaining fhe window adverfisemenf). 

In each of fhe ssh examples in fhis chapfer so far, we have seen an unchang¬ 
ing window adverfisemenf conveyed from one TCP peer fo fhe ofher. Examples 
include 8320 byfes, 4220 byfes, and 32,900 byfes. These sizes represenf fhe amounf 
of space fhe sender of fhe segmenf has reserved for sforing incoming dafa fhe 
peer sends. When TCP-based applicafions are nof busy doing ofher fhings, fhey 
are fypically able fo consume any and all dafa TCP has received and queued for 
fhem, leading fo no change of fhe Window Size field as fhe connecfion progresses. 
On slow sysfems, or when fhe applicafion has ofher fhings fo accomplish, dafa 
may have arrived for fhe applicafion, been acknowledged by TCP, and be sitting 
in a queue waifing for fhe applicafion fo read or "consume" if. When TCP sfarfs fo 
queue dafa in fhis way, fhe amounf of space available fo hold new incoming dafa 
decreases, and fhis change is reflecfed by a decreasing value of fhe Window Size 
field. Evenfually, if fhe applicafion does nof read or ofherwise consume fhe dafa 
af all, TCP musf fake some acfion fo cause fhe sender fo cease sending new dafa 
enfirely, because fhere would be no place fo puf if on arrival. This is accomplished 
by sending a window adverfisemenf of zero (no space). 

The Window Size field in each TCP header indicafes fhe amounf of empfy 
space, in byfes, remaining in fhe receive buffer. The field is 16 bifs in TCP, buf wifh 
fhe Window Scale opfion, values larger fhan 65,535 can be used (see Chapfer 13). 
The largesf sequence number fhe sender of a segmenf is willing fo accepf in fhe 
reverse direcfion is equal fo fhe sum of fhe Acknowledgment Number and Window 
Size fields in fhe TCP header (scaled appropriafely). 

15.5.1 Sliding Windows 

Each endpoinf of a TCP connecfion is capable of sending and receiving dafa. The 
amounf of dafa senf or received on a connecfion is mainfained by a sef of window 
structures. For each acfive connecfion, each TCP endpoinf mainfains a send window 
structure and a receive window structure. These sfrucfures are similar fo fhe con- 
cepfual window sfrucfures described in Chapfer 12, buf here we describe fhem in 
more defail. Figure 15-9 shows a hypofhefical TCP send window sfrucfure. 
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Figure 15-9 The TCP sender-side sliding window structure keeps track of which sequence numbers 
have already been acknowledged, which are in flight, and which are yet to be sent. The 
size of the offered window is controlled by the Window Size field sent by the receiver 
in each ACK. 


TCP maintains its window structures in terms of bytes (not packets). In Fig¬ 
ure 15-9 we have numbered the bytes 2 through 11. The window advertised by the 
receiver is called the offered window and covers bytes 4 through 9, meaning that the 
receiver has acknowledged all bytes up through and including number 3 and has 
advertised a window size of 6. Recall from Chapfer 12 fhaf fhe Window Size field con- 
fains a byfe offsef relafive to fhe ACK number. The sender computes ifs usable window, 
which is how much dafa if can send immediately The usable window is fhe offered 
window minus fhe amounf of dafa already senf buf nof yef acknowledged. The vari¬ 
ables SND.UNA and SND.WND are used to hold fhe values of fhe leff window edge 
and offered window. The variable SND.NXT holds fhe nexf sequence number to be 
senf, so fhe usable window is equal to (SND.UNA + SND.WND - SND.NXT). 

Over fime fhis sliding window moves to fhe righf, as fhe receiver acknowl¬ 
edges dafa. The relafive mofion of fhe fwo ends of fhe window increases or 
decreases fhe size of fhe window. Three terms are used to describe fhe movemenf 
of fhe righf and leff edges of fhe window: 

1. The window closes as fhe leff edge advances to fhe righf. This happens when 
dafa fhaf has been senf is acknowledged and fhe window size gefs smaller. 

2. The window opens when fhe righf edge moves to fhe righf, allowing more 
dafa to be senf. This happens when fhe receiving process on fhe ofher end 
reads acknowledged dafa, freeing up space in ifs TCP receive buffer. 

3. The window shrinks when fhe righf edge moves to fhe leff. The Hosf 
Requiremenfs RFC [RFC1122] sfrongly discourages fhis, buf TCP musf be 
able to cope wifh if. Secfion 15.5.3 on silly window syndrome shows an 
example where one side would like to shrink fhe window by moving fhe 
righf edge fo fhe leff buf cannof. 
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Because every TCP segment contains both an ACK number and a window 
advertisement, a TCP sender adjusts the window structure based on both values 
whenever an incoming segment arrives. The left edge of fhe window cannof move 
fo fhe leff, because fhis edge is confrolled by fhe ACK number received from fhe 
ofher end fhaf is cumulafive and never goes backward. When fhe ACK number 
advances fhe window buf fhe window size does nof change (a common case), fhe 
window is said fo advance or "slide" forward. If fhe ACK number advances buf 
fhe window adverfisemenf grows smaller wifh ofher arriving ACKs, fhe leff edge 
of fhe window moves closer fo fhe righf edge. If fhe leff edge reaches fhe righf 
edge, if is called a zero window. This stops fhe sender from fransmiffing any dafa. 
If fhis happens, fhe sending TCP begins fo probe fhe peer's window (see Secfion 
15.5.2) fo look for an increase in fhe offered window. 

The receiver also keeps a window sfrucfure, which is somewhaf simpler fhan 
fhe sender's. The receiver window sfrucfure keeps frack of whaf dafa has already 
been received and ACKed, as well as fhe maximum sequence number if is willing 
fo receive. The TCP receiver depends on fhis sfrucfure fo ensure fhe correcfness 
of fhe dafa if receives. In parficular, if wishes fo avoid storing duplicafe byfes if 
has already received and ACKed, and if also wishes fo avoid storing byfes fhaf if 
should nof have received (any byfes beyond fhe sender's righf window edge). The 
receiver's window sfrucfure is illusfrafed in Figure 15-10. 
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Figure 15-10 The TCP receiver-side sliding window structure helps the receiver know which 
sequence numbers to expect next. Sequence numbers in the receive window are stored 
when received. Those outside the window are discarded. 


This structure also contains a left and right window edge like the sender's 
window, but the in-window bytes (4-9 in this picture) need not be differentiated 
as they are in the sender's window structure. For the receiver, any bytes received 
with sequence numbers less than the left window edge (called RCV.NXT) are dis¬ 
carded as duplicates, and any bytes received with sequence numbers beyond the 
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right window edge (RCV.WND bytes beyond RCV.NXT) are discarded as out of 
scope. Bytes arriving with any sequence number in the receive window range are 
accepted. Note that the ACK number generated at the receiver may be advanced 
only when segments fill in direcfly af fhe leff window edge because of TCP's 
cumulafive ACK sfrucfure. Wifh selecfive ACKs, ofher in-window segmenfs can 
be acknowledged using fhe TCP SACK opfion, buf ulfimafely fhe ACK number 
ifself is advanced only when dafa configuous fo fhe leff window edge is received 
(see Chapfer 14 for more defails on SACK). 

15.5.2 Zero Windows and the TCP Persist Timer 

We have seen that TCP implements flow control by having the receiver specify 
the amount of data it is willing to accept from the sender: the receiver's adver¬ 
tised window. When the receiver's advertised window goes to zero, the sender is 
effectively stopped from transmitting data until the window becomes nonzero. 
When the receiver once again has space available, it provides a window update to 
the sender to indicate that data is permitted to flow once again. Because such 
updates do not generally contain data (they are a form of "pure ACK"), they are 
not reliably delivered by TCP. TCP must therefore handle the case where such 
window updates that would open the window are lost. 

If an acknowledgment (containing a window update) is lost, we could end up 
with both sides waiting for the other: the receiver waiting to receive data (because 
it provided the sender with a nonzero window and expects to see incoming data) 
and the sender waiting to receive the window update allowing it to send. To pre¬ 
vent this form of deadlock from occurring, the sender uses a persist timer to query 
the receiver periodically, to find out if the window size has increased. The persist 
timer triggers the transmission of window probes. Window probes are segments 
that force the receiver to provide an ACK, which also necessarily contains a Win¬ 
dow Size field. The Host Requirements RFC [RFC1122] suggests that the first probe 
should happen after one RTO and subsequent problems should occur at exponen¬ 
tially spaced intervals (i.e., similar to the "second part" of Karn's algorithm, which 
we discussed in Chapter 14). 

Window probes contain a single byte of data and are therefore reliably deliv¬ 
ered (retransmitted) by TCP if lost, thereby eliminating the potential deadlock 
condition caused by lost window updates. The probes are sent whenever the TCP 
persist timer expires, and the byte included may or may not be accepted by the 
receiver, depending on how much buffer space it has available. As with the TCP 
retransmission timer (see Chapter 14), the normal exponential backoff can be used 
when calculating the timeout for the persist timer. An important difference, how¬ 
ever, is that a normal TCP never gives up sending window probes, whereas it may 
eventually give up trying to perform retransmissions. This can lead to a certain 
resource exhaustion vulnerability that we discuss in Section 15.7. 
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15.5.2.1 Example 

To illustrate the use of the dynamic window size adjustment and flow confrol in 
TCP, we creafe a TCP connecfion and cause fhe receiving process fo pause before 
consuming dafa from fhe nefwork. For fhis experimenf, we use a Mac OS X 10.6 
sender and a Windows 7 receiver. The receiver runs our sock program wifh fhe 
-P flag as follows: 


C:\> sock -i -s -P 20 6666 


This arranges for fhe receiver fo pause 20s prior fo consuming dafa from fhe nef¬ 
work. The resulf is fhaf evenfually fhe receiver's adverfised window begins fo 
close, as shown wifh packef 125 in Figure 15-11. 
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Figure 15-11 After a period when the advertised window does not change, acknowledgments 
continue but the window size grows smaller as the receiver's buffer fills up. If the 
receiving application fails to consume any data and the sender continues, the window 
eventually reaches zero. 
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In this trace we see that for more than 100 packets the receiver's window 
remains pegged at 64KB. This is because of an aufomafic window adjusfmenf 
algorifhm (see Secfion 15.5.4) fhaf allocafes memory fo fhe receiving TCP even if 
nof requesfed by fhe applicafion. However, fhis evenfually runs shorf, so we see 
fhe window begin fo reduce sfarfing wifh packef 125. A large number of ACKs fol¬ 
low, each reducing fhe window furfher while increasing fhe ACK number by 2896 
byfes per ACK. This indicafes fhaf fhe receiving TCP is sforing fhe dafa, buf fhe 
applicafion is nof consuming if. If we look furfher info fhe frace, we see fhaf even¬ 
fually fhe receiver has no more space fo hold fhe incoming dafa (see Figure 15-12). 
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Figure 15-12 The receiver's buffer has filled up. When the receiving application starts reading again, a win¬ 
dow update tells the sender that there is now an opportunity to transfer more data. 


Here we can see that packet 151 fills the small 327-byte window, as indicated 
by the TCP Window Full comment provided by Wireshark. After about 200ms, 
at time 4.979, a zero window advertisement is produced, indicating that no more 
data can be received. This is no surprise, given that the sender has filled the last 
known available window and the receiving application will not consume any data 
until time 20.143. 

After receiving the zero window advertisement, the sending TCP tries to probe 
the receiver three times at 5s intervals to see if the window has opened. At time 
20, as instructed, the receiver begins to consume the data present in TCP's queue. 
This causes two window updates to be sent to the sender, indicating that further 
data transmission (up to 64KB) is now possible. Such segments are called window 
updates because they do not acknowledge any new data—they just advance the 
right edge of the window. At this point, the sender is able to resume normal data 
transmission and complete the transfer. 
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There are numerous points that we can summarize using Figures 15-11 and 
15-12: 

1. The sender does not have to transmit a full window's worth of dafa. 

2. A single segmenf from fhe receiver acknowledges dafa and slides fhe win¬ 
dow fo fhe righf af fhe same fime. This is because fhe window adverfise- 
menf is relafive fo fhe ACK number in fhe same segmenf. 

3. The size of fhe window can decrease, as shown by fhe series of ACKs in 
Figure 15-11, buf fhe righf edge of fhe window does nof move leff, so as fo 
avoid window shrinkage. 

4. The receiver does nof have fo waif for fhe window fo fill before sending an 
ACK. 

In addifion fo fhese poinfs, if is insfrucfive fo look af fhe fhroughpuf fhis connec- 
fion achieves as a funcfion of fime. Using Wireshark's Sfafisfics I TCP Sfream Graph 
I Throughpuf Graph funcfion, we observe fhe fime series as shown in Figure 15-13. 


TCP Graph 2: pause.td 10.0.1.33:53005 -> 10.0.1.37:6666 
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• . 
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Figure 15-13 With a relatively large receive buffer, a significant amount of data can be transferred 
even before the receiving application reads any data from the network. 
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Here we see an interesting behavior. Even before the receiving application has 
consumed any data, the connection still achieves a throughput of approximafely 
1.3MB/S. This confinues unfil approximafely fime 0.10. Affer fhaf, fhe fhroughpuf is 
essenfially zero unfil fhe receiver begins consuming dafa much lafer (affer fime 20). 

15.5.3 Silly Window Syndrome (SWS) 

Window-based flow confrol schemes, especially fhose fhaf do nof use fixed-size 
segmenfs (such as TCP), can fall vicfim fo a condifion known as fhe silly window 
syndrome (SWS). When if occurs, small dafa segmenfs are exchanged across fhe 
connecfion insfead of full-size segmenfs [RFC0813]. This leads fo undesirable inef¬ 
ficiency because each segmenf has relafively high overhead—a small number of 
dafa byfes relafive fo fhe number of byfes in fhe headers. 

SWS can be caused by eifher end of a TCP connecfion: fhe receiver can adver- 
fise small windows (insfead of waifing unfil a larger window can be adverfised), 
and fhe sender can fransmif small dafa segmenfs (insfead of waifing for addi- 
fional dafa fo send a larger segmenf). Correcf avoidance of silly window syndrome 
requires a TCP fo implemenf rules specifically for fhis purpose, whefher operafing 
as a sender or a receiver. TCP never knows ahead of fime how a peer TCP will 
behave. The following rules are applied: 

1. When operafing as a receiver, small windows are nof adverfised. The receive 
algorifhm specified by [RFC1122] is fo nof send a segmenf adverfising a 
larger window fhan is currenfly being adverfised (which can be 0) unfil fhe 
window can be increased by eifher one full-size segmenf (i.e., fhe receive 
MSS) or by one-half of fhe receiver's buffer space, whichever is smaller. 
Nofe fhaf fhere are fwo cases where fhis rule can come info play: when buf¬ 
fer space has become available because of an applicafion consuming dafa 
from fhe nefwork, and when TCP musf respond fo a window probe. 

2. When sending, small segmenfs are nof senf and fhe Nagle algorifhm gov¬ 
erns when fo send. Senders avoid SWS by nof fransmiffing a segmenf 
unless af leasf one of fhe following condifions is frue: 

a. A full-size (send MSS byfes) segmenf can be senf. 

b. TCP can send af leasf one-half of fhe maximum-size window fhaf fhe 
ofher end has ever adverfised on fhis connecfion. 

c. TCP can send everyfhing if has fo send and eifher (i) an ACK is nof cur¬ 
renfly expecfed (i.e., we have no oufsfanding unacknowledged dafa) or 
(ii) fhe Nagle algorifhm is disabled for fhis connecfion. 

Condifion (a) is fhe mosf sfraighfforward and direcfly avoids fhe high-over- 
head segmenf problem. Condifion (b) deals wifh hosfs fhaf always adverfise finy 
windows, perhaps smaller fhan fhe segmenf size. Condifion (c) prevenfs TCP from 
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sending small segments when there is unacknowledged data waiting to be ACKed 
and the Nagle algorithm is enabled. If the sending application is doing small writes 
(e.g., smaller than the segment size), condition (c) avoids silly window syndrome. 

These three conditions also let us answer the following quesfion: If fhe Nagle 
algorifhm prevenfs us from sending small segmenfs while fhere is oufsfanding 
unacknowledged dafa, how small is small? From condifion (a) we see fhaf "small" 
means fhaf fhe number of byfes is less fhan fhe SMSS (i.e., fhe largesf packef size 
fhaf does nof exceed fhe PMTU or fhe receiver's MSS). Condifion (b) comes info 
play only wifh older, primifive hosfs or when a small adverfised window is used 
because of a limifed receive buffer size. 

Condifion (b) of sfep 2 requires fhaf fhe sender keep frack of fhe maximum 
window size adverfised by fhe ofher end. This is an affempf by fhe sender fo guess 
fhe size of fhe ofher end's receive buffer. Alfhough fhe size of fhe receive buffer 
could decrease while fhe connecfion is esfablished, in pracfice fhis is rare. Furfher- 
more, recall fhaf TCP avoids window shrinkage. 

15.5.3.1 Example 

We will now presenf a defailed example fo see silly window syndrome avoidance 
in acfion; fhis example also involves fhe persisf fimer. We will use our sock pro¬ 
gram wifh a Windows XP sending hosf and a FreeBSD receiver, doing fhree 2048- 
byfe wrifes fo fhe nefwork. The command af fhe sender is as follows: 

C:\> sock -i -n 3 -w 2048 10.0.0.8 6666 

The corresponding command af fhe receiver is 


FreeBSD% sock -i -s -P 15 -p 2 -r 256 -R 3000 6666 


This fixes fhe receive buffer af 3000 byfes, causes an inifial delay of 15s before 
reading from fhe nefwork, injecfs 2s of delay befween each read, and sefs each 
read amounf fo be 256 byfes. The reason for fhe inifial pause is fo lef fhe receiver's 
buffer fill, ulfimafely forcing fhe fransmiffer fo stop. By having fhe receiver fhen 
perform small reads from fhe nefwork, we expecf fo see if perform silly window 
syndrome avoidance. Figure 15-14 is fhe frace as displayed by Wireshark. 

The confenfs of fhe enfire connecfion are displayed in fhe figure. Packef 
lengfhs are described in ferms of how many TCP payload byfes are included in 
each segmenf. During connecfion esfablishmenf, fhe receiver adverfises a window 
of 3000 byfes wifh an MSS of 1460 byfes. The sender sends a 1460-byfe packef 
(packef 4) af fime 0.052 and 588 byfes (packef 5) af fime 0.053. The sum of fhese 
sizes equals fhe 2048-byfe write size used by fhe applicafion. Packef 6 acknowl¬ 
edges bofh dafa packefs from fhe sender and provides a window adverfisemenf of 
952 byfes (3000 - 1460 - 588 = 952). 

The 952-byfe window (packef 6) is nof as large as a full MSS, so fhe Nagle 
algorifhm af fhe sender prevenfs filling if immediately Insfead, we see a delay 
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Figure 15-14 Trace of a TCP transfer illustrating silly window syndrome avoidance. The sender avoids filling 
the offered window at time 0.053 because of sender-side SWS avoidance. Instead, it waits until 
time 5.066, also acting effectively as a window probe. Receiver-side SWS avoidance can be seen 
by looking at packet 14, which advertises a zero window even though the receiver has consumed 
some data. 


of 5s before any furfher acfion is faken. The sender waifs for 5s, unfil fhe persisf 
fimer expires, before sending a window probe. Given fhaf fhe sender is sending a 
packef anyhow, fhe sending TCP adds fhe permiffed 952 byfes fo fill fhe available 
window. This fills fhe window, as confirmed by fhe zero window adverfisemenf 
confained in packef 8. 

The nexf evenf in fhe frace is when TCP sends a window probe af fime 6.970, 
abouf 2s affer receiving fhe firsf zero window adverfisemenf. The probe ifself con- 
fains a single dafa byfe and is labeled "TCP ZeroWindowProbe" by Wireshark, 
buf fhe ACK for fhis does nof move fhe ACK number forward (Wireshark labels 
fhis a "TCP ZeroWindowProbeAck"), so fhe byfe has nof been kepf af fhe receiver. 
Anofher 1-byfe probe is produced af fime 10.782 (abouf 4s lafer), and anofher af 
fime 18.408 (abouf 8s lafer), showing fhe characferisfic exponenfial fimeouf back¬ 
off. Nofe fhaf for fhis laffer window probe, fhe single byfe is acknowledged by fhe 
receiver. 
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At time 25.061, after the application has had a chance to perform six 256-byfe 
reads (spaced 2s aparf), a window updafe indicafes fhaf 1535 byfes (plus 1 for fhe 
ACK number) are now free in fhe receiver's buffer. This is "large enough" accord¬ 
ing fo receiver-side SWS avoidance. The sender begins fo fill fhe window, sfarfing 
wifh a 1460-byfe packef af fime 25.064, resulfing in an ACK af fime 25.161 for byfe 
4462 wifh a window adverfisemenf of only 75 byfes (packef 17). This adverfise- 
menf appears fo violafe our rule fhaf fhe amounf adverfised should be af leasf an 
MSS or (in fhe case of FreeBSD) one-quarfer of fhe fofal buffer. The reason is fo 
avoid window shrinkage. Wifh fhe lasf window updafe (packef 15), fhe receiver 
adverfises a righf window edge of byfe (3002 + 1535) = 4537. If fhe presenf ACK 
(packef 17) were fo adverfise less fhan 75 byfes, as would be required by receiver- 
side SWS avoidance, fhe righf window edge would move leff, a condifion TCP is 
nof supposed fo allow. Consequenfly fhe 75-byfe adverfisemenf represenfs a form 
of override: avoiding window shrinkage is preferred fo avoiding SWS. 

We see fhe effecf of sender-side SWS avoidance once again wifh fhe 5s delay 
befween packefs 17 and 18. The sender is forced fo send fhe 75-byfe packef and 
fhe receiver responds wifh anofher zero window adverfisemenf. Packef 20, which 
appears a second lafer, is anofher window probe, which resulfs in a window of 767 
byfes. Anofher round of sender-side SWS avoidance resulfs in a 5s delay; fhe sender 
fills fhe window, again resulfing in a zero window; and fhe paffern repeafs. The 
paffern is evenfually broken because fhe sender has no more dafa fo send. Packef 
30 represenfs fhe lasf dafa senf, and fhe connecfion is evenfually closed some 20s 
lafer (because of fhe 2s delays befween each read af fhe receiving applicafion). 

To undersfand fhe relafionships among fhe applicafion behavior, fhe adver¬ 
fised window, and SWS avoidance, we can capfure fhe connecfion's dynamics in 
fabular form. Table 15-1 gives fhe acfion af fhe sender and fhe receiver, as well as 
an esfimafed fime when fhe receiving applicafion performs ifs reads. 


Table 15-1 Dynamics of the window advertisement and application to avoid silly window syndrome 
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Table 15-1 Dynamics of the window advertisement and application to avoid silly window syndrome (continued) 
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Table 15-1 Dynamics of the window advertisement and application to avoid silly window syndrome (continued) 


Time 

Packet 

Number 

Action 

Receive Buffer 

TCP Sender 

TCP Receiver 

Application 

Data 

Available 

41 




256 byte read 

1721 

1279 

42.784 

26 

5306:6073(767) 



2488 

512 

42.881 

27 


ACK 6074 
win 0 


2488 

512 

43 




256 byte read 

2232 

768 

43.485 

28 

6073:6073(1) 



2233 

767 

43.485 

29 


ACK 6074 
win 767 


2233 

767 

43.486 

30 

6074:6144(71) 



2304 

696 

43.581 

31 


ACK 6145 
win 696 


2304 

696 

43.711 

32 

6145 (FIN) 





43.711 

33 


ACK 6146 
win 695 


2305 

695 

45,47,49,51 

53,55 




6x256 byte 
read 

769 

2231 

55.212 

34 


ACK 6146 
win 2232 


768 

2232 

57,59,61 




3x256 byte 
read 

0 

3000 

63 




0 byte read 

0 

3000 

63.252 

35 


FIN 


0 

3000 


In Table 15-1, the first column is the relative point in time for each acfion if if 
appears in fhe frace. Those fimes wifh fhree digifs fo fhe righf of fhe decimal poinf 
are faken from fhe Wireshark oufpuf (refer fo Figure 15-16). Those fimes wifh no 
digifs fo fhe righf of fhe decimal poinf are fhe inferred fimes of fhe acfion on fhe 
receiving hosf, which are nof represenfed in fhe frace. 

The amounf of dafa in fhe receiver's buffer (labeled "Dafa" in fhe fable) 
increases when dafa arrives from fhe sender and decreases as fhe applicafion 
reads (consumes) dafa from fhe buffer. Whaf we wanf fo follow are fhe window 
adverfisemenfs senf by fhe receiver fo fhe sender, and whaf fhose window adver- 
fisemenfs confain. This lefs us see how fhe receiver avoids SWS. 

As discussed previously, fhe firsf evidence of SWS avoidance is fhe 5s delay 
befween segmenfs 6 and 7, where fhe sender avoids frying fo send wifh a 952- 
byfe window unfil if is forced fo. When fhis happens, fhe receiver fills up, caus¬ 
ing a series of zero window adverfisemenfs and window probe exchanges. We 
can see fhe exponenfial backoff on fhe persisf fimer in acfion: probes are senf af 
fimes 6.970,10.782, and 18.408. These are approximafely 2,4, and 8s from when fhe 
sender firsf received fhe zero window adverfisemenf af fime 5.160. 
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Although the application reads data at times 15 and 17, it has read only 512 
bytes by time 18.408. The receiver-side SWS avoidance rules dictate that no win¬ 
dow update should be provided to the sender because the available 512 bytes of 
buffer are neifher half fhe size of fhe fofal buffer (3000 byfes) nor af leasf one MSS 
(1460 byfes). Lacking a window updafe, fhe sender sends a window probe af fime 
18.408 (segmenf 13). This probe is received and fhe byfe is kepf by fhe receiver, 
because some buffer space is available, as verified by fhe increasing ACK number 
befween segmenfs 12 and 14. 

Alfhough 511 byfes are available in fhe receiver's buffer, receiver-side SWS 
avoidance kicks in once again. The FreeBSD implemenfafion of receiver SWS 
avoidance differenfiafes befween when fo send a window updafe and how fo 
respond fo a window probe. Alfhough if follows fhe rules in [RFC1122] and sends 
a window updafe only when af leasf half of fhe fofal receive buffer (or an MSS) can 
be adverfised, when responding fo a window probe if adverfises a larger window 
when fhe window is eifher af leasf an MSS size or when af leasf one-fourth of fhe 
fofal receive buffer size can be adverfised. In eifher case, fhe 511 byfes are less fhan 
a full MSS and also less fhan 3000/4 = 750 byfes, so fhis form of receiver-side SWS 
avoidance dicfafes fhaf fhe window adverfisemenf included in fhe ACK for seg¬ 
menf 13 musf confain fhe value 0. 

By fhe fime fhe applicafion complefes ifs sixfh read af fime 25, fhe receive buf¬ 
fer has 1535 byfes free (more fhan half of fhe fofal 3000-byfe size), so a window 
updafe is senf (segmenf 15). The sender confinues wifh a full-size segmenf (seg¬ 
menf 16), for which if receives an ACK buf a window adverfisemenf of only 75 
byfes. In fhe nexf 5s, bofh sender- and receiver-side SWS avoidance fakes place. 
The sender waifs for a larger window adverfisemenf, and fhe applicafion performs 
reads af fimes 27 and 29, buf fhe 587 byfes of free receive buffer space are nof 
enough fo allow a window updafe fo be senf. The sender fherefore has fo waif fhe 
enfire 5s and evenfually sends ifs 75 byfes, forcing fhe receiver again info SWS 
avoidance. 

Wifh fhe receiver nof providing a window updafe, fhe sender's persisf fimer 
causes a window probe fo be senf af fime 31.548. In fhis case, fhe FreeBSD receiver 
responds wifh a nonzero window, of size 767 byfes (larger fhan one-fourfh of fhe 
fofal receive buffer). This window is nof large enough for fhe sender's SWS avoid¬ 
ance procedure, however, so fhe sender waifs anofher 5s and fhe process repeafs. 
Finally, af fime 43.486, fhe lasf 71 byfes are senf and acknowledged. The acknowl- 
edgmenf confains a window adverfisemenf of 696 byfes. Alfhough if is less fhan 
one-quarfer of fhe receiver's fofal buffer size, fhe adverfisemenf is nof made zero 
by receiver-side SWS avoidance in order fo avoid window shrinkage. 

The connecfion ferminafion begins wifh segmenf 32, which confains no dafa. 
If is acknowledged immediafely wifh a window adverfisemenf of 695 byfes (fhe 
FIN consumed a sequence number af fhe receiver). Affer fhe applicafion complefes 
anofher six reads, fhe receiver provides a window updafe, buf fhe sender is done 
sending and remains silenf. The applicafion performs anofher four reads, fhree of 
which refurn 256 byfes and fhe final one of which refurns nofhing, indicafing fhe 
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end of arriving data. At this point, the receiver closes the connection, causing the 
FIN to be sent to the sender. The sender responds with the final ACK, completing 
the bidirectional closing of the connection. 

Because the sending application issues a close operation after performing its 
three 2048-byte writes, the sender's end of the connection goes from the ESTAB¬ 
LISHED state to the EIN_WAIT_1 state after sending segment 32 (see Chapter 
13). It then goes to the EIN_WAIT_2 state after receiving segment 33. Although 
it receives a window update while in this state, no action is taken, because it has 
already sent a EIN that has been acknowledged (there is no timer in this state). 
Instead, it merely sits in this state until receiving a EIN from the other end. This is 
why we see no further transmissions by the sender until it receives the EIN (seg¬ 
ment 35). 

15.5.4 Large Buffers and Auto-Tuning 

In this chapter, we have seen that an application using a small receive buffer size 
may be doomed to significant throughput degradation compared to other applica¬ 
tions using TCP in similar conditions. Even if the receiver specifies a large enough 
buffer, the sender might specify too small a buffer, ultimately leading to bad per¬ 
formance. This problem became so important that many TCP stacks now decouple 
the allocation of the receive buffer from the size specified by the application. In 
most cases, the size specified by the application is effectively ignored, and the 
operating system instead uses either a large fixed value or a dynamically calcu¬ 
lated value. 

In newer versions of Windows (Vista/7) and Linux, receive window auto¬ 
tuning [S98] is supported. With auto-tuning, the amount of data that can be out¬ 
standing in the connection (its bandwidth-delay product, an important concept 
we discuss in Chapter 16) is continuously estimated, and the advertised window 
is arranged to always be at least this large (provided enough buffer space remains 
to do so). This has the advantage of allowing TCP to achieve its maximum avail¬ 
able throughput rate (subject to the available network capacity) without having to 
allocate excessively large buffers at the sender or receiver ahead of time. In Win¬ 
dows, the receiver's buffer size is auto-sized by the operating system by default. 
However, the behavior can be modified using the netsh command: 


C:\> netsh interface tcp set heuristics disabled 
C:\> netsh interface tcp set global autotuninglevel=X 

where X is one of the following: disabled, highlyrestricted, restricted, 
normal, or experimental. The setting affects the automatic selection of the 
receiver's advertised window. In the disabled state, auto-tuning is not used, and the 
window size uses a default value. The restricted modes slow the window growth, 
and the normal setting allows it to grow relatively quickly. The experimental 
mode allows the window to grow very aggressively but is not recommended for 
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normal use because many Internet sites and some firewalls interfere wifh or fail fo 
implemenf fhe TCP Window Scale opfion properly. 

Wifh Linux 2.4 and lafer, sender-side aufo-funing is supporfed. Wifh version 
2.6.7 and lafer, bofh receiver- and sender-side aufo-funing is supporfed. However, 
aufo-funing is subjecf fo limifs placed on fhe buffer sizes. The following Linux 
sysctl variables confrol fhe sender and receiver maximum buffer sizes. The val¬ 
ues affer fhe equal sign are fhe defaulf values (which may vary depending on fhe 
parficular Linux disfribufion), which should be increased if fhe sysfem is fo be 
used in high bandwidfh-delay-producf environmenfs: 

net.core.rmem_max = 131071 
net.core.wmem_max = 131071 
net.core.rmem_default = 110592 
net.core.wmem_default = 110592 

In addifion, fhe aufo-funing paramefers are given by fhe following variables: 

net.ipv4.tcp_rmem = 4096 87380 174760 
net.ipv4.tcp_wmem = 4096 16384 131072 

Each of fhese variables confains fhree values: fhe minimum, defaulf, and max¬ 
imum buffer size used by aufo-funing. 

15.5.4.1 Example 

To demonsfrafe fhe behavior of receiver aufo-funing, we use a Windows XP sender 
(sef fo use large windows and window scaling) and a Linux 2.6.11 receiver fhaf 
includes aufo-funing. Af fhe sender, we issue fhe following command: 

C:\> sock -n 512 -i 10.0.0.1 6666 

Af fhe receiver, we do nof specify any seffing for fhe receive buffer, buf we do 
arrange for an inifial delay of 20s before fhe applicafion performs any reads: 


Linux% sock -i -s -v -P 20 6666 


To illusfrafe fhe growfh of fhe receiver's adverfised window, we can use Wire- 
shark fo sorf fhe displayed packefs based on fhe receiver's address (see Figure 15-15). 
During connecfion esfablishmenf, fhe receiver begins wifh an inifial window size 
of 1460 byfes and an inifial MSS of 1412 byfes. If is using window scaling, wifh a 
shiff amounf of 2 (nof shown), leading fo a maximum usable window of 256KB. 
We can see fhaf affer fhe inifial packefs, fhe window increases, which corresponds 
fo fhe sender's increase in fhe dafa sending rafe. We explore fhe sender's dafa rafe 
confrol when we invesfigafe TCP congesfion confrol in Chapfer 16. For now, we 
need only know fhaf when fhe sender sfarfs up, if fypically sfarfs by sending one 
packef and fhen increases fhe amounf of oufsfanding dafa by one MSS packef for 
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Figure 15-15 The Linux receiver performs receiver-side auto-tuning by increasing the window as more data is 
received. Because the application does not read for 20s, the window eventually closes. 


each ACK it receives that indicates progress. Thus, it typically sends two MSS-size 
segments for each ACK it receives. 

Looking at the pattern of fhe window adverfisemenfs—10712, 13536, 16360, 
19184,.. .—we can see fhaf fhe adverfised window is increased by fwice fhe MSS 
on each ACK, which mimics fhe way fhe sender's congesfion confrol scheme oper- 
afes, as we shall see in Chapfer 16. Provided enough memory is available af fhe 
receiver, fhe adverfised window is always larger fhan whaf fhe sender is permif- 
fed fo send according fo ifs congesfion confrol limifafions. This is fhe besf case— 
fhe minimal amounf of buffer space is being used and adverfised by fhe receiver 
fhaf keeps fhe sender sending as fasf as possible. 
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If the receiver exhausts its buffers, auto-tuning is compromised. In this 
example, by time 0.678 the pattern of window growth reverses, having achieved 
a maximum of 33,304 bytes. The window size is no longer increasing, but instead 
the buffer is filling up while the application pauses. When the application begins 
reading at time 20, the window size again increases and goes beyond the point 
where it was previously (see Figure 15-16). 
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Figure 15-16 With the application pausing before reading, auto-tuning is compromised because the receive 
buffer becomes full. As the application begins reading, the advertised window increases, 
exceeding its previous value. 


The zero window advertisement (packet 117) forces the sender to perform a 
series of window probes, resulting in a series of zero window advertisements. 
After the application begins reading at time 20.043, a window update is sent to 
the sender. The window begins to grow once again, twice the MSS in bytes for 
each ACK. As the sender continues to send additional data and the receiver con¬ 
sumes it, the receiver continues to increase the advertised window until the value 
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67808 is reached, which is the largest value the receiver ever advertises on this 
connection. This version of Linux also measures the time between adjacent appli¬ 
cation read completions and compares this value against the estimated connection 
round-trip time. If fhe RTT esfimafe increases, fhe buffer size is also increased 
(if is nof decreased if fhe RTT becomes smaller). This helps aufo-funing keep fhe 
receiver's adverfised window ahead of fhe sender's window even when fhe con- 
necfion's bandwidfh-delay producf is increasing. 

The problem of TCP applicafions using foo-small buffers became a signifi- 
canf one as fasfer wide area Infernef connecfions became available. In fhe Unifed 
Sfafes, wifh cross-counfry round-frip fimes of approximafely 100ms, using a 64KB 
window over a IGb/s nefwork limifs TCP fhroughpuf fo abouf 640KB/s insfead 
of fhe calculafed maximum of abouf 130MB/s (a 99% wasfe of bandwidfh). Pracfi- 
cally speaking, if is nof uncommon fo see a factor of 100 increase in fhroughpuf 
performance when moving from a TCP wifh limifed buffers fo one wifh larger 
buffers on such nefworks. Significanf credif should be given fo fhe WeblOO projecf 
[WlOO]. If creafed a sef of fools and soffware pafches in an efforf fo maximize fhe 
available fhroughpuf performance an applicafion can obfain from various TCP 
implemenfafions. 


15.6 Urgent Mechanism 

We saw in Chapter 12 fhaf fhe TCP header has a special URG bif field fo indicate 
"urgenf dafa." An applicafion is able fo mark dafa as urgenf by specifying a special 
opfion fo fhe Berkeley sockefs API (MSG_OOB) when if performs a write opera- 
fion, alfhough fhe use of urgenf dafa is no longer recommended [RFC6093]. When 
fhe sender's TCP receives such a write requesf, if enfers a special sfafe called urgent 
mode. Upon entering urgenf mode, if records fhe lasf byfe fhe applicafion specified 
as urgenf dafa. This is used fo sef fhe Urgent Pointer field in each subsequenf TCP 
header fhe sender generafes unfil fhe applicafion ceases wrifing urgenf dafa and 
all fhe sequence numbers up fo fhe urgenf poinfer have been acknowledged by 
fhe receiver. According fo [RFC6093], fhe urgenf poinfer poinfs fo fhe sequence 
number of fhe byfe of dafa following fhe lasf byfe of urgenf dafa. This resolves a 
longsfanding ambiguify in various RFCs fhaf included confradicfory sfafemenfs 
abouf fhe semanfics of fhe Urgent Pointer field. When an IPv6 jumbogram is used, 
fhe Urgent Pointer value of 65535 may be used fo indicafe fhe end of urgenf dafa is 
fo be found af fhe end of fhe TCP dafa area [RFC2675], beyond fhe 64K byfe offsef 
expressible using fhe convenfional 16-bif Urgent Pointer field. 

A receiving TCP enfers urgenf mode when if receives a segmenf wifh fhe URG 
bif field sef. The receiving applicafion can discover whefher ifs TCP has entered 
urgenf mode using a sfandard sockef API call (select ()). The operafion of fhe 
urgenf mechanism has been a source of confusion because fhe Berkeley sockefs 
API and documenfafion use fhe term out-of-band (OOB) dafa, alfhough in real- 
ify TCP does nof implemenf any frue OOB capabilify. Insfead, virfually all TCP 
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implementations deliver the last byte of urgent data to an application using a dis¬ 
tinct API call parameter at the receiver. The receiver must specify eifher fhe MSG_ 
OOB opfion fo refrieve fhe special byfe or specify MSG_OOBINLINE fo have fhe 
special byfe remain in fhe regular dafa sfream (fhis is now fhe required mefhod, 
assuming fhe urgenf mechanism is used af all). 

15.6.1 Example 

To gef a beffer undersfanding of fhe urgenf mechanism, we use a Mac OS X sender 
and Linux receiver fo show how urgenf mode behaves, including whaf happens 
during a zero window evenf. To achieve fhis, we firsf limif receive window aufo- 
funing on fhe Linux receiver: 


Linux# sysctl -w net.ipv4.tcp_nnem='4096 4096 174760' 
Linux% sock -i -v -s -p 1 -P 10 5555 


The firsf command ensures fhaf any receive window aufomafic adjusfmenf 
does nof exceed 4KB. This will be useful fo us in order fo see whaf happens when 
fhe window closes. The second command invokes fhe server and insfrucfs if fo 
waif 10s before performing any reads, and fo waif Is befween each read operafion 
if does perform. Af fhe clienf, we execufe fhe following command: 


Mac% sock -i -n 7 -U 7 -p 1 -S 8192 10.0.1.1 5555 

SO_SNDBUF = 8192 

connected on 10.0.1.33.51101 to 10.0.1.1.5555 

TCP_MAXSEG = 1448 

wrote 1024 bytes 

wrote 1024 bytes 

wrote 1024 bytes 

wrote 1024 bytes 

wrote 1024 bytes 

wrote 1024 bytes 

wrote 1 byte of urgent data 

wrote 1024 bytes 


This command creafes a clienf fhaf performs seven 1024-byfe wrifes spaced Is 
aparf buf also performs a wrife of 1 byfe of urgenf dafa prior fo fhe lasf wrife. The 
clienf's buffer is sufficienfly large (sef fo 8192 byfes) fhaf fhis applicafion complefes 
execufion immediafely because all fhe dafa if sends is buffered by fhe sending TGP. 

In Figure 15-17, we can see how fhe inifial righf window edge adverfised by 
fhe receiver is 2800 and is quickly increased fo 5121. Af fime 1.0 fhe applicafion 
performs a wrife, and fhe righf window edge advances fo abouf 6145. From fhen on 
fhe receiver's window increases no more because aufo-funing has been effecfively 
disabled above 4192 byfes and fhe receiving applicafion has nof performed any 
reads. Unfil fime 10.0, fhe sender probes fhe receiver buf no addifional window 
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C TCPGraph 1: urgent-deino.tr 10.0.1.33:51101 -> 10.0.1.1:5555 Q®® 



Figure 15-17 After six write operations, the receiver's window has not advanced. The sending TCP stops 
transmitting until the window opens at time 10. 


growth occurs. Finally, when the receiver starts performing read operations after 
time 10.0, the window opens and the sender completes the transfer. The packefs 
exchanged are shown in Figure 15-18. 

The "exif poinf" for urgenf mode is defined fo be fhe sum of fhe Sequence Num¬ 
ber field and fhe Urgent Pointer field in a TCP segmenf. Only one urgenf "poinf" (a 
sequence number offsef) is mainfained per TCP connecfion, so a packef arriving 
wifh a valid Urgent Pointer field causes fhe informafion confained in any previous 
urgenf poinfer fo be losf. Segmenf 16 is fhe firsf segmenf confaining a valid urgenf 
poinfer, resulfing in an exif poinf relafive sequence number of 6146. Nofe fhaf fhis 
sequence number may nof be confained in fhe segmenf providing fhe indicafion 
buf could insfead be in some lafer segmenf. This is fhe case wifh segmenf 17, for 
example, which confains no dafa buf includes fhe urgenf poinfer (wifh value 1). 

As menfioned before, fhere has been some historical confusion abouf whefher 
fhe exif poinf indicates fhe lasf byfe of urgenf dafa or fhe following firsf byfe of 
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Figure 15-18 The entire data transfer showing a zero window advertisement from the receiver at time 5.012. 

When the application performs its next writes, the sending TCP enters urgent mode, resulting 
in the URG bit being set starting at time 6.0113 on a window probe segment containing one 
sequence number. At time 7 the application performs its final write and closes, producing two 
empty segments. A window update at time 10.006 restarts the data transfer. A zero window 
advertisement at time 10.009 again stops the transfer but also indicates that urgent mode can 
now be exited because the urgent pointer has been acknowledged. The FIN at time 11.007 con¬ 
tains the final data byte. 


nonurgent data. In [RFC1122], the pointer is declared to point to the last byte of 
urgent data. However, essentially all TCP implementations do not follow fhis 
specificafion, so [RFC6093] recognizes fhis facf and changes various specificafions 
fo make fhe poinfer indicafe fhe firsf byfe of nonurgenf dafa. In fhis example, fhe 
byfe wifh sequence number 6145 confains fhe 1 byfe of urgenf dafa produced by 
fhe sock clienf, buf in all fhe segmenfs we have seen fhe urgenf poinfer has a 
value of 1 when fhe sequence number field is 6145. Consequenfly, we can see fhaf 
wifh fhis implemenfafion of TCP, as wifh mosf, fhe exif poinf is fhe sequence num¬ 
ber of fhe firsf byfe of nonurgenf dafa. 

As we can see from fhis example, TCP carries urgenf dafa inline wifh fhe dafa 
sfream (nof "ouf of band"). If an applicafion really wanfs a separafe ouf-of-band 
channel, a second TCP connecfion is fhe easiesf way fo accomplish fhis. (Some 
fransporf-layer protocols do provide whaf mosf people consider OOB dafa: a logi¬ 
cally separafe dafa pafh using fhe same connecfion as fhe normal dafa pafh. This 
is nof whaf TCP provides.) 
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15.7 Attacks Involving Window Management 

The window management procedures for TCP have been the subject of various 
attacks, primarily forms of resource exhaustion. In essence, advertising a small 
window slows a TCP transfer, tying up resources such as memory for a potentially 
long time. This has been used as a form of attack on bad traffic (i.e., worms). The 
LaBrea tarpit [LOl], for example, arranges to complete the TCP three-way hand¬ 
shake and then either does nothing or produces minimal responses that simply 
cause the sending TCP to continually slow down. This keeps the sending TCP 
busy and essentially slows down worm propagation. Tarpits are thus attacks on 
attacking traffic. 

A more recent attack was published in 2009 [109], based on a known vulner¬ 
ability of the persist timer. It uses a client-side variety of the "SYN cookies" tech¬ 
nique (see Chapter 13). All the necessary connection state can thus be offloaded 
onto the victim machine, minimizing the amount of resources consumed at the 
attacker's machine. The attack itself is similar to the LaBrea idea, except it focuses 
specifically on the persist timer. Multiple such attacks can be mounted on the 
same server, which can lead to resource exhaustion (e.g., running out of system 
memory). The "solution" to this attack, as suggested by [C723308], is to allow some 
other process to terminate TCP connections when resource exhaustion appears to 
be taking place. 


15.8 Summary 

Interactive data is normally transmitted in segments smaller than the SMSS. 
Delayed acknowledgments may be used by the receiver of these small segments 
to see if the acknowledgment can be piggybacked along with data going back to 
the sender. This often reduces the number of segments, especially for interactive 
traffic, where the server is echoing the characters typed at the client. However, it 
may introduce additional delay. 

On connections with relatively large round-trip times, such as WANs, the 
Nagle algorithm is often used to reduce the number of small segments. This algo¬ 
rithm limits the sender to a single small packet of unacknowledged data at any 
time. While this can reduce the number of high-overhead small packets in the 
network and reduce the total number of packets carried during a connection, it 
adds delay that is sometimes unacceptable to applications. In addition, the interac¬ 
tion between delayed ACKs and the Nagle algorithm can lead to an undesirable 
form of temporary deadlock. Because of these issues, the Nagle algorithm can be 
disabled by applications, and most interactive applications take advantage of this 
capability. 

TCP implements flow control by including a window advertisement on every 
ACK it sends. Such window advertisements signal the peer TCP how much buf¬ 
fer space is left at the endpoint that sent the window advertisement ACK. The 
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maximum window advertisement is 65,535 bytes unless the Window Scale TCP 
option is used. In that case, the maximum window advertisement can be much 
larger (about 1GB). 

The window advertisement can be as small as 0 bytes, indicating that the 
receiver is completely full. When this happens, the sender stops sending data 
and instead begins probing the closed window using a retransmission interval 
with a backoff scheme similar fo fimer-based refransmissions (see Chapfer 14). 
This probing of fhe closed window confinues indefinifely, unfil eifher an ACK is 
refurned indicafing a larger window or fhe receiver sends an unsolicifed window 
adverfisemenf (a window updafe) because buffer space has become available. This 
indefinife behavior has been used fo creafe a resource exhausfion affack againsf 
TCP. 

During fhe developmenf of TCP, a curious phenomenon was observed. When 
a small window was adverfised, fhe sender would immediafely fill if. This behav¬ 
ior, which causes fhe connecfion fo use a large number of high-overhead small 
packefs, would confinue unfil fhe connecfion became idle and was dubbed "silly 
window syndrome." Techniques were creafed fo avoid if, applying fo bofh fhe 
TCP send and receive logic. The sender avoids sending small segmenfs when a 
small window is adverfised; receivers fry fo avoid ever sending small window 
adverfisemenfs. 

The size of fhe receiver's window is limifed by fhe size of fhe receiver's buffer. 
Hisforically, applicafions fhaf failed fo specify fheir receive buffers would be allo- 
cafed a relafively small buffer fhaf would cause fhroughpuf performance fo suffer 
over nefwork pafhs wifh high bandwidfh and high delay. In more recenf operaf- 
ing sysfems, aufo-funing sefs fhe buffer size allocafed aufomafically in an efficienf 
way, causing such concerns fo largely be a fhing of fhe pasf. 
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16.1 Introduction 

In this chapter we investigate how TCP approaches the issue of congestion control, 
which is most important in the context of bulk dafa fransfers. Congesfion confrol 
is a sef of behaviors defermined by algorifhms fhaf each TCP implemenfs in an 
affempf fo prevenf fhe nefwork from being overwhelmed by too large an aggre- 
gafe offered fraffic load. The basic approach is fo have TCP slow down when if has 
reason fo believe fhe nefwork is abouf fo be congested (or is already so congested 
fhaf roufers are discarding packefs). The challenge is fo determine exacfly when 
and how TCP should slow down, and when if can speed up again. 

TCP is a protocol designed fo provide reliable delivery of dafa from one sysfem 
fo anofher. We have already seen in Chapfer 15 how a sending TCP can be made 
fo slow down if ifs peer (receiving) TCP cannof keep up. This is accomplished by 
TCP's procedures for flow control and is realized by a sender adapfing ifs sending 
rafe based on fhe adverfised Window Size field provided by a receiver in ifs ACKs. 
This provides explicif informafion abouf fhe sfafe of fhe receiver back fo fhe sender 
and allows if fo avoid overrunning fhe receiver's buffers. 

Consider whaf happens when fhe nefwork befween a collecfion of senders and 
receivers is asked fo carry more fraffic fhan if can handle. Eifher fhe senders musf 
slow down or fhe nefwork musf ulfimafely fhrow some dafa away (or some combi- 
nafion fhereof). This facf arises from fhe mosf basic observafion from queuing fhe- 
ory as applied af a router: even if fhe router can store some dafa, if fhe long-term dafa 
arrival rafe exceeds fhe long-term deparfure rafe, any amounf of intermediate stor¬ 
age will grow wifhouf bound. Sfafed more simply, if a router receives more dafa per 
unif fime fhan if can send ouf, if musf store fhaf dafa. If fhis sifuafion persisfs, evenfu- 
ally fhe storage will run ouf and fhe router will be forced fo drop some of fhe dafa. 

This sifuafion, when a router is forced fo discard dafa because if cannof handle 
fhe arriving fraffic rafe, is called congestion. The roufer is said fo be congested when 
if is in fhis sfafe, and even a single connecfion can drive one or more roufers info 
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congestion. Left unaddressed, congestion can cause the performance of a nefwork 
fo be reduced so badly fhaf if becomes unusable. In fhe very worsf cases, if is said 
fo be in a sfafe of congestion collapse. To eifher avoid or af leasf reacf effecfively fo 
mifigafe fhis sifuafion, each TCP implemenfs congestion control procedures. Differ- 
enf versions or varianfs of TCP (and fhe operafing sysfems fhaf hosf fhe TCP/IP 
slack) have somewhaf differenf procedures and behaviors. We will discuss mosf 
of fhe beffer-known ones in fhis chapfer. 

16.1.1 Detection of Congestion in TCP 

As we have seen, the primary mechanism TCP has available to combat packet loss 
is retransmission, induced either by a retransmission timer expiring, or by the fast 
retransmit algorithm (see Chapter 14). Consider, for a moment, the consequence 
of many TCP connections that share an Internet path simply retransmitting more 
packets while the network is in a state of congestion collapse. As you can imagine, 
this only makes the situation worse. It has been called the analog of pouring gaso¬ 
line on a fire and is something to be avoided. 

In order to deal with congestion, we would like to have sending TCPs slow 
down when congestion is present (or about to be) and, if the congestion has sub¬ 
sided, detect and use an appropriate amount of new bandwidth when it becomes 
available. In the Internet, this can be quite challenging, as there has traditionally 
been no explicit way for a sending TCP to learn about the state of the intermediate 
routers. In other words, there is no explicit signaling about congestion. Instead, if 
a typical TCP is to react somehow to congestion, it must first conclude that con¬ 
gestion is occurring. This is usually accomplished by detecting that one or more 
packets have been lost. In TCP, an assumption is made that a lost packet is an indi¬ 
cator of congestion, and that some response (i.e., slowing down in some way) is 
required. We shall see that TCP has been this way since the late 1980s. Other meth¬ 
ods for detecting congestion, including measuring delay and network-supported 
Explicit Congestion Notification (ECN), which we discuss in Section 16.11, allow TCP 
to learn about congestion even before it has become so bad as to cause dropped 
packets. We discuss these approaches after studying the "classic" algorithms. 


Note 

In today’s wired networks, packet loss is caused primarily by congestion in routers 
or switches. With wireless networks, transmission and reception errors become a 
significant cause of packet ioss. Determining whether loss is due to congestion or 
transmission errors has been an active research topic since the mid-1990s when 
wireless networks started to attain widespread use. 


In Chapter 14 we saw how TCP can use timers, acknowledgments, and selec¬ 
tive acknowledgments to detect and recover from dropped packets. When packets 
are detected as lost, it is TCP's responsibility to resend them. We are now concerned 
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with what else TCP does when it observes a lost packet. In particular, we are inter¬ 
ested in how it interprets this as a signal that congestion has occurred, and that it 
should slow down. Just how it slows down and when (and how it speeds back up 
again) are the main subjects of the following secfions. We begin wifh fhe classic 
algorifhm used on a new connecfion fo esfablish fhe base dafa rafe and confinue 
wifh anofher classic algorifhm fhaf is used by TCP during ifs sfeady-sfafe operafion 
when performing large dafa fransfers. We will also incorporafe fhe recommended 
variafions on fhese algorifhms info fhe discussion and discuss ofher modifica- 
fions fhaf have been made over fhe years. We will also examine an exfended frace 
in defail. We conclude wifh a discussion of some of fhe securify issues relafed fo 
TCP congesfion confrol and summarize fhe mosf imporfanf poinfs. The area of 
congesfion confrol has been a ferfile area for nefworking researchers [RFC6077], 
and several new papers on fhis subjecf fend fo appear each year. 

16.1.2 Slowing Down a TCP Sender 

One defail we need fo address righf away is jusf how fo slow down a TCP sender. 
We saw in Chapfer 15 fhaf fhe Window Size field in fhe TCP header is used fo sig¬ 
nal a sender fo adjusf ifs window based on fhe availabilify of buffer space af fhe 
receiver. We can go a sfep furfher and arrange for fhe sender fo slow down if eifher 
fhe receiver is foo slow or fhe nefwork is too slow. This is accomplished by infro- 
ducing a window confrol variable af fhe sender fhaf is based on an esfimafe of fhe 
nefwork's capacify and ensuring fhaf fhe sender's window size never exceeds fhe 
minimum of fhe fwo. In effecf, a sending TCP fhen sends af a rafe equal fo whaf 
fhe receiver or fhe nefwork can handle, whichever is less. 

The new value used fo hold fhe esfimafe of fhe nefwork's available capacify is 
called fhe congestion window, wriffen more compacfly as simply cwnd. The sender's 
acfual (usable) window W is fhen wriffen as fhe minimum of fhe receiver's adver- 
fised window awnd and fhe congesfion window: 

W = mm{cwnd, awnd) 

Wifh fhis relafionship, fhe TCP sender is nof permiffed fo have more fhan W 
unacknowledged packefs or byfes oufsfanding in fhe nefwork. The fofal amounf 
of dafa a sender has infroduced info fhe nefwork for which if has nof yef received 
an acknowledgmenf is somefimes called the flight size, which is always less fhan or 
equal fo W. In general, W can be mainfained in eifher packef or byfe unifs. 


Note 

When TCP does not make use of selective acknowledgment, the restriction on 
1/1/ means that the sender is not permitted to send a segment with a sequence 
number greater than the sum of the highest acknowledged sequence number and 
the value of H/. A SACK TCP sender treats 1/1/somewhat differently, using it as an 
overall limit to the flight size. 
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This all seems logical but is far from fhe whole sfory. Because bofh fhe sfafe 
of fhe nefwork and fhe sfafe of fhe receiver change wifh fime, fhe values of bofh 
awnd and cwnd change over fime. In addifion, because of fhe lack of explicif signals 
(see fhe preceding secfion), fhe "correcf" value of cwnd is generally nof direcfly 
available fo fhe sending TCP. Thus, all of fhe values W, cwnd, and awnd musf be 
empirically defermined and dynamically updafed. In addifion, as we said before, 
we do nof wanf W fo be too big or too small—we wanf if fo be sef fo abouf fhe 
bandwidth-delay product (BDP) of fhe nefwork pafh, also called fhe optimal window 
size. This is fhe amounf of dafa fhaf can be sfored in fhe nefwork in fransif fo fhe 
receiver. If is equal fo fhe producf of fhe RTT and fhe capacify of fhe lowesf capac- 
ify ("boffleneck") link on fhe pafh from sender fo receiver. Generally, fhe sending 
sfrafegy is fo keep fhe nefwork busy by arranging fo have an amounf of dafa af 
leasf as large as fhe BDP in fhe nefwork. Using an oufsfanding limif fhaf subsfan- 
fially exceeds fhe BDP, however, is usually undesirable as if can lead fo unwanfed 
delays (see Secfion 16.10). On fhe Infernef, determining fhe BDP for a connecfion 
can be challenging, given fhaf routes, delay, and fhe level of sfafisfical mulfiplex- 
ing (i.e., sharing of capacify) change as a funcfion of fime. 


Note 

Although handling congestion at the TCP sender is our primary area of inter¬ 
est, work has been done on handling the cases where congestion occurs on the 
reverse path, because of ACKs. In [RFC5690] a method is introduced to inform 
a TCP receiver of the ACK ratio it shouid use (i.e., how many packets it shouid 
receive before sending an ACK). 


16.2 The Classic Algorithms 

When a new TCP connection first starts out, it usually has no idea what the initial 
value for cwnd should be, as it has no idea how much network capacity is available 
for it to send its data. (There are some exceptions, such as systems that cache per¬ 
formance values that were determined earlier. These were called destination met¬ 
rics in Chapter 14.) TCP learns the value for awnd with one packet exchange to the 
receiver, but without any explicit signaling, the only obvious way it has to learn a 
good value for cwnd is to try sending data at faster and faster rates until it experi¬ 
ences a packet drop (or other congestion indicator). This could be accomplished 
by either sending immediately at the maximum rate it can (subject to the value 
of awnd), or it could start more slowly. Because of the detrimental effects on the 
performance of other TCP connections sharing the same network path that could 
be experienced when starting at full rate, a TCP generally uses one algorithm to 
avoid starting so fast when it starts up to get to steady state. It uses a different one 
once it is in steady state. 

The operation of TCP congestion control at a sender is driven or "clocked" by 
the receipt of ACKs. If a TCP is operating at steady state (with an appropriate value 
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of cwnd), receipt of an ACK indicates that one or more packets have been removed 
from the network, and consequently that an opportunity to send more has arisen. 
Following this line of reasoning, the TCP congestion behavior in steady state 
attempts to achieve a conservation of packets in the network (see Figure 16-1). The 
term conservation here is used in the sense it is in physics—that some quantity 
(e.g., momentum, energy) going into a system does not simply disappear or appear 
but rather can be found as long as proper accounting is performed. 



Figure 16-1 TCP congestion control operates on a principle of conservation of packets. Packets (P^) 
are "stretched out" in time as they are sent from sender to receiver over links with con¬ 
strained capacity. As they are received at the receiver spaced apart (P^), ACKs are gener¬ 
ated (A), which return to the sender. ACKs traveling from receiver to sender become 
spaced out (A^ in relation to the inter-packet spacing of the packets. When ACKs reach 
the sender (A), their arrivals provide a signal or "ACK clock," used to tell the sender it 
is time to send more. In steady state, the overall system is said to be "self-clocked." The 
figure is adapted from [J88] and copied from S. Seshan's CMU Lecture Notes dated March 22, 2005. 


This idea is illustrated in Figure 16-1. We shall call the top and bottom objects 
"funnels." The top funnel holds (larger) data packets traveling along the path from 
the sender to the receiver. The comparatively narrow width of the funnel depicts 
how packets are "stretched out" in time as they travel through a relatively slow 
link. The ends of the funnels (at sender and receiver) show the queues where pack¬ 
ets are held before or after they travel along the path. The bottom funnel holds the 
ACKs sent by the receiver back to the sender that correspond to the data packets 
in the top funnel. When operating efficiently at steady state, there are no bunches 
of packets in the top or bottom funnels. In addition, there is no significant extra 
space between packets in the top funnel. Note that an arrival of an ACK at the 
sender "liberates" another data packet to be sent into the top funnel, and that 
this happens at just the right time (i.e., when the network is able to accept another 
packet). This relationship is sometimes called self-clocking, because the arrival of 
an ACK, called the ACK clock, triggers the system to take the action of sending 
another packet. 

We now turn to the main two algorithms of TCP: slow start and congestion 
avoidance. These algorithms, based on the principles of packet conservation and 
ACK clocking, were first formally described in the classic paper by Jacobson [J88]. 
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An update to the congestion avoidance algorithm was given by Jacobson a couple 
of years later [J90]. These algorithms do not operate at the same time—^TCP exe¬ 
cutes only one at any given time, but it may switch back and forth between the 
two. We now explore these in more detail and examine what determines when 
each of fhem is used. We also look af how fhey have been modified and exfended 
since fhey were inifially implemenfed. Each TCP connecfion is able fo individually 
execufe fhese algorifhms. 

16.2.1 Slow Start 

The slow start algorithm is executed when a new TCP connection is created or 
when a loss has been detected due to a retransmission timeout (RTO). It may also 
be invoked after a sending TCP has gone idle for some time. The purpose of slow 
start is to help TCP find a value for cwnd before probing for more available band¬ 
width using congestion avoidance and to establish the ACK clock. Typically, a 
TCP begins a new connection in slow start, eventually drops a packet, and then 
settles into steady-state operation using the congestion avoidance algorithm (Sec¬ 
tion 16.2.2). To quote from [RFC5681]: 

Beginning transmission into a network with unknown conditions requires TCP 
to slowly probe the network to determine the available capacity, in order to avoid 
congesting the network with an inappropriately large burst of data. The slow start 
algorithm is used for this purpose at the beginning of a transfer, or after repairing 
loss detected by the retransmission timer. 

A TCP begins in slow start by sending a certain number of segments (after 
the SYN exchange), called the initial window (IW). The value of IW was originally 
one SMSS, although with [RFC5681] it is allowed to be larger. The formula works 
as follows: 

IW = 2*{SMSS) and not more than 2 segments (if SMSS > 2190 bytes) 

IW = 3*{SMSS) and not more than 3 segments (if 2190 > SMSS > 1095 bytes) 

IW = 4’^(SMSS) and not more than 4 segments (otherwise) 

While this assignment for IW may allow several packets (e.g., three or four) 
in the initial window, we shall discuss the case where IW = 1 SMSS for simplicity. 
A TCP just starting out begins its connection, then, with cwnd = 1 SMSS, meaning 
the initial usable window W is also equal to SMSS. Note that in most cases SMSS 
is equal to the smaller of the receiver's MSS and the path MTU (less header sizes). 

Assuming no packets are lost and each packet causes an ACK to be sent in 
response, an ACK is returned for the first segment, allowing the sending TCP 
to send another segment. However, slow start operates by incrementing cwnd by 
min(N, SMSS) for each good ACK received, where N is the number of previously 
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unacknowledged bytes ACKed by the received "good ACK." A good ACK is one 
that returns a higher ACK number than has been seen so far. 


Note 

The number of bytes ACKed is used to support Appropriate Byte Counting (ABC) 
[RFC3465], an experimental specification recommended by [RFC5681]. It can be 
used to counter an “ACK division” attack, described in Section 16.12, where many 
small ACKs are used in an attempt to cause a TCP sender to send faster than nor¬ 
mal. Linux uses the Boolean system configuration variable net.ipv4.tcp_abc 
to determine if ABC is enabled (default no). In recent versions of Windows, ABC 
defaults to on. 


Thus, after one segment is ACKed, the cwnd value is ordinarily increased to 2, 
and two segments are sent. If each of fhose causes new good ACKs fo be refurned, 
2 increases fo 4, 4 fo 8, and so on. In general, assuming no loss and an ACK for 
every packef, fhe value of W affer k round-frip exchanges is W = 2*^. Rewrifing, 
we can say fhaf k = log^W RTTs are required fo reach an operafing window of W. 
This growfh seems quife "fasf" (increasing as an exponenfial funcfion) buf is sfill 
"slower" fhan whaf TCP would do if if were allowed fo send immediafely a win¬ 
dow of packefs equal in size fo fhe receiver's adverfised window. (Recall fhaf W is 
sfill never allowed fo exceed awnd.) 

If we imagine a TCP connecfion where fhe receiver's adverfised window is 
very large (say, infinifely large), cwnd is fhe primary governor of fhe sending rafe 
(provided fhere is somefhing for fhe sender fo send). As we saw, fhis value grows 
exponenfially fasf in fhe RTT of fhe connecfion. So, evenfually, cwnd (and fhus 
W) could become so large fhaf fhe corresponding window of packefs senf over¬ 
whelms fhe nefwork (recall fhaf TCP's fhroughpuf rafe is proporfional fo W/RTT). 
When fhis happens, cwnd is reduced subsfanfially (fo half of ifs former value). In 
addifion, fhis is fhe poinf af which TCP swifches from operafing in slow sfarf fo 
operafing in congesfion avoidance. The swifch poinf is defermined by fhe relafion- 
ship befween cwnd and a value called fhe slow start threshold (or ssthresh). 

Figure 16-2 (leff) illusfrafes fhe operafion of slow sfarf. The numbers are in 
unifs of fhe RTT of fhe connecfion. Assuming fhe connecfion sfarfs ouf wifh one 
packef (fop), one ACK is refurned, allowing fwo packefs fo be senf during fhe 
second RTT. These packefs cause fwo ACKs fo be refurned. The TCP sender incre- 
menfs cwnd by one segmenf for each ACK refurned, so fhe process confinues. The 
exponenfial growfh of cwnd as a funcfion of fime is illusfrafed on fhe righf. The 
second line shows how cwnd grows when every ofher packef is acknowledged, 
which is common when delayed ACKs are being used. In fhis case, fhe growfh 
is sfill exponenfial buf nof as rapid. For fhis reason, some TCPs arrange fo delay 
ACKs only affer fhe connecfion has complefed slow sfarf. In Linux, fhis is called 
quick acknowledgments ("quickack mode") and has been parf of fhe basic TCP/IP 
slack since kernel version 2.4.4. 



734 


TCP Congestion Control 



140 


120 


100 


80 


60 


40 


20 




cwnd —I— 
cwnd(delack) —«— , 




















3 4 

RTT Number 


Figure 16-2 Operation of the classic slow start algorithm. In the simple case where ACKs are not delayed, 
every arriving good ACK allows the sender to inject two new packets (left). This leads to an expo¬ 
nential growth in the size of the sender's window as a function of time (right, upper line). When 
ACKs are delayed, such as when an ACK is produced for every other packet, the growth is still 
exponential but slower (right, lower line). 


16.2.2 Congestion Avoidance 

Slow start, just described, is used when initiating data flow across a connection or 
after a loss event invoked by a timeout. It increases cwnd fairly rapidly and helps fo 
esfablish a value for ssthresh. Once fhis is achieved, fhere is always fhe possibilify 
fhaf more nefwork capacify may become available for a connecfion. If such capac- 
ify were fo be immediafely used wifh large fraffic bursfs, ofher TCP connecfions 
wifh packefs sharing fhe same queues in roufers would likely experience signifi- 
canf packef drops, leading fo overall insfabilify in fhe nefwork as many connec¬ 
fions simulfaneously experience packef drops and reacf wifh refransmissions. 

To address fhe problem of frying fo find addifional capacify fhaf may become 
available, buf fo nof do so too aggressively, TCP implemenfs fhe congestion avoid¬ 
ance algorifhm. Once ssthresh is esfablished and cwnd is af leasf af fhis level, a 
TCP runs fhe congesfion avoidance algorifhm, which seeks addifional capacify by 
increasing cwnd by approximately one segmenf for each window's worfh of dafa 
fhaf is moved from sender fo receiver successfully. This provides a much slower 
growfh rafe fhan slow sfarf: approximately linear in terms of fime, as opposed fo 
slow sfarf's exponenfial growfh. More precisely, cwnd is usually updated as fol¬ 
lows for each received nonduplicafe ACK: 


cwnd^^^ = cwnd^ + SMSS SMSS/cwnd^ 
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Looking at this relationship briefly, assume cwndg = k*SMSS bytes were sent 
into the network in k segments. After the first ACK arrives, cwnd is updated to be 
larger by a factor of (l/k): 

cwnd^ = cwndg + SMSS SMSS/cwndg = k*SMSS + SMSS {SMSS/{k*SMSS)) = 
k*SMSS + (l/k) SMSS = {k + {l/k))*SMSS = cwndg + {l/k)*SMSS 

Because fhe value of cwnd grows slighfly wifh each new ACK arrival, and fhis 
value is in fhe denominator of fhe expression in fhe firsf equafion above, fhe overall 
growfh rate of cwnd is slighfly sublinear. Nonefheless, we generally fhink of con- 
gesfion avoidance growing fhe window linearly wifh respecf to fime (Figure 16-3), 
whereas slow sfarf grows if exponenfially wifh respecf to fime (Figure 16-2). This 
funcfion is also called additive increase because a parficular value (abouf one packef 
in fhis case) is added to cwnd for each successfully received window's worfh of dafa. 
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Figure 16-3 Operation of the congestion avoidance algorithm. In the simple case where ACKs are not delayed, 
every arriving good ACK allows the sender to inject approximately 1/W fraction of a new packet. 
This leads to approximately linear growth in the size of the sender's window as a function of time 
(right, upper line). When ACKs are delayed, such as when an ACK is produced for every other 
packet, the growth is still approximately linear but somewhat slower (right, lower line). 


Figure 16-3 (left) illustrates the operation of congestion avoidance. Once again, 
the numbers are in units of the RTT of the connection. Assuming the connection 
sends four packets (top), four ACKs are returned, allowing cwnd to grow slightly. 
By the second RTT period, the growth is enough to overcome the integer rounding 
and cause an increase of one SMSS to cwnd, allowing one additional packet to be 
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sent. The growth of ciund as a nearly linear function of fime is illusfrafed on fhe 
righf, on a linear-linear plof. The second line fo fhe righf shows how cwnd grows 
when every ofher packef is acknowledged, simulafing fhe use of delayed ACKs. In 
fhis case, fhe growfh is sfill abouf linear, buf nof as rapid. 

The assumpfion of fhe algorifhm is fhaf packef loss caused by bif errors is very 
small (much less fhan 1%), and fherefore fhe loss of a packef signals congesfion 
somewhere in fhe nefwork befween fhe source and desfinafion. If fhis assumpfion 
is false, which if somefimes is for wireless nefworks, TCP slows down even when 
no congesfion is presenf. In addifion, many RTTs may be required for fhe value 
of cwnd fo grow large, which is required for efficienf use of nefworks wifh high 
capacify. Fixing fhese issues wifh TCP has been a popular area for research, and 
we discuss some of fhe various approaches lafer. 

16.2.3 Selecting between Slow Start and Congestion Avoidance 

In normal operations, a TCP connection is always running either the slow start or 
the congestion avoidance procedure, but never the two simultaneously. We now 
turn to the question. What determines the algorithm TCP uses at any given time? 
We already know that slow start is used when a new connection is created or 
when a timeout-based retransmission occurs. We now turn to what controls the 
selection between slow start and congestion avoidance. 

We mentioned ssthresh earlier. This threshold is a limit on the value of cwnd 
that determines which algorithm is in operation, slow start or congestion avoid¬ 
ance. When cwnd < ssthresh, slow start is used, and when cwnd > ssthresh, con¬ 
gestion avoidance is used. When they are equal, either can be used. The most 
important distinction between slow start and congestion avoidance, as we have 
seen, is how each modifies the value of cwnd when new ACKs arrive. What makes 
TCP somewhat tricky and interesting is that the value of ssthresh is not fixed but 
instead varies over time. Its main purpose is to remember the last "best" estimate 
of the operating window when no loss was present. Said another way, it holds the 
lower bound on TCP's best estimate of the optimal window size. 

The initial value of ssthresh may be set arbitrarily high (e.g., to awnd or higher), 
which causes TCP to always start with slow start. When a retransmission occurs, 
caused by either a retransmission timeout or the execution of fast retransmit, 
ssthresh is updated as follows: 

ssthresh = max(flight size/2, 2*SMSS) [1] 


Note 

In Microsoft’s most recent (“Next Generation”) TCP/IP stack, this equation is 
reportedly changed to the somewhat more conservative relationship: ssthresh = 
max(min(cwnc/, awnd)l2, 2*SMSS) [NB08]. 
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Here we see that if a retransmission is required, TCP assumes that the oper¬ 
ating window must have been too large for fhe nefwork fo handle. Reducing 
fhe esfimafe of fhe opfimal window size is accompanied by alfering ssthresh fo 
be abouf half of whaf fhe currenf window size is (buf nof ever below fwice fhe 
SMSS). This usually resulfs in lowering ssthresh, buf if can also resulf in increasing 
ssthresh. If we examine fhe congesfion avoidance procedure for TCP, we recall fhaf 
if an enfire window's worfh of dafa is successfully exchanged, fhe value of cwnd is 
allowed fo increase by approximafely 1 SMSS. Thus, if cwnd has grown large over 
a considerable amounf of fime, seffing ssthresh fo half of fhe flighf size could cause 
if fo increase. This happens when TCP has discovered more usable bandwidfh. 
The inferplay befween ssthresh and cwnd, in conjuncfion wifh fhe operafion of slow 
sfarf and congesfion avoidance, gives TCP ifs characferisfic behavior in fhe face of 
congesfion. We now explore fhe complefe, combined algorifhms. 

16.2.4 Tahoe, Reno, and Fast Recovery 

The algorifhms discussed so far, slow sfarf and congesfion avoidance, consfifufe 
fhe firsf congesfion confrol algorifhms applied fo TCP. They were infroduced in 
fhe lafe 1980s wifh fhe 4.2 release of UC Berkeley's version of UNIX, called fhe 
Berkeley Software Distribution, or BSD UNIX. Thus began fhe convenfion of nam¬ 
ing various versions of TCP affer US. cifies, especially fhose where gambling is 
permiffed. 

The 4.2 release of BSD (called Tahoe) included a version of TCP fhaf sfarfed 
connecfions in slow sfarf, and if a packef was losf, defecfed by eifher a fimeouf or 
fhe fasf refransmif procedure, fhe slow sfarf algorifhm was reinifiafed. Tahoe was 
implemenfed by simply reducing cwnd fo ifs sfarfing value (1 SMSS af fhaf fime) 
upon any loss, forcing fhe connecfion fo slow sfarf unfil cwnd grew fo fhe value 
ssthresh. 

One problem wifh fhis approach is fhaf for large BDP pafhs, fhis can cause fhe 
connecfion fo significanfly underufilize fhe available bandwidfh while fhe send¬ 
ing TCP goes fhrough slow sfarf fo gef back fo fhe poinf af which if was operafing 
before fhe packef loss. To address fhis problem, fhe reinifiafion of slow sfarf on any 
packef loss was reconsidered. Ulfimafely, if packef loss is defecfed by duplicafe 
ACKs (invoking fasf refransmif), cwnd is insfead resef fo fhe lasf value of ssthresh 
insfead of only 1 SMSS. (Slow sfarf is sfill inifiafed on a fimeouf, which is generally 
fhe case for mosf TCP varianfs.) This approach allows fhe TCP fo slow down fo 
half of ifs previous rafe wifhouf reverfing fo slow sfarf. 

In exploring fhe issue of large BDP pafhs furfher and fhinking back fo fhe 
conservafion of packefs principle menfioned before, if has been observed fhaf any 
ACKs fhaf are received, even while recovering affer a loss, sfill represenf oppor- 
funifies fo injecf new packefs info fhe nefwork. This became fhe basis of fhe fast 
recovery procedure, which was released in conjuncfion wifh fhe popular 4.3 BSD 
Reno version of BSD UNIX. Fasf recovery allows cwnd fo (femporarily) grow by 1 
SMSS for each ACK received while recovering. The congesfion window is fherefore 
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inflated for a period of fime, allowing an addifional new packef fo be senf for each 
ACK received, unfil a good ACK is seen. Any nonduplicafe ("good") ACK causes 
TCP fo exif recovery and reduces fhe congesfionback fo ifs pre-inflafed value. TCP 
Reno became very popular and ulfimafely fhe basis for whaf mighf reasonably be 
called "sfandard TCP." 

16.2.5 Standard TCP 

Alfhough whaf consfifufes "sfandard" TCP is subjecf fo some debafe, fhe algo- 
rifhms we have discussed so far consfifufe fhe primary procedures idenfified 
wifh sfandard TCP operafion. The slow sfarf and congesfion avoidance algorifhms 
are usually implemenfed fogefher, and fhe baseline overall behavior is given in 
[RFC5681]. This specificafion does nof require fhe use of fhese exacf algorifhms, 
buf a requiremenf is imposed fhaf any TCP implemenfafion nof be more aggres¬ 
sive fhan fhese algorifhms would allow. 

To summarize fhe combined algorifhm from [RFC5681], TCP begins a con- 
necfion in slow sfarf {cwnd = IW) wifh a large value of ssthresh, generally af leasf 
fhe value of awnd. Upon receiving a good ACK (one fhaf acknowledges new dafa), 
TCP updafes fhe value of cwnd as follows: 

cwnd += SMSS (if cwnd < ssthresh) Slow sfarf 

cwnd+= SMSS*SMSS/cwnd {if cwnd > ssthresh) Congesfion avoidance 

When fasf refransmif is invoked because of receipf of a fhird duplicafe ACK 
(or ofher signal, if convenfional fasf refransmif inifiafion is nof used), fhe follow¬ 
ing acfions are performed: 

1. ssthresh is updafed fo no more fhan fhe value given in equafion [1]. 

2. The fasf refransmif algorifhm is performed, and cwnd is sef fo {ssthresh + 
3*SMSS). 

3. cwnd is femporarily increased by SMSS for each duplicafe ACK received. 

4. When a good ACK is received, cwnd is resef back fo ssthresh. 

The acfions in sfeps 2 and 3 consfifufe/asf recovery. Sfep 2 firsf adjusfs cwnd, 
which usually causes if fo be reduced fo half of ifs former value, and fhen fempo¬ 
rarily inflafes if fo fake info accounf fhe facf fhaf fhe receipf of each duplicafe ACK 
indicafes fhaf some packef has leff fhe nefwork (and fhus should permif anofher fo 
be inserfed). This sfep is also where multiplicative decrease occurs, as cwnd is ordi¬ 
narily mulfiplied by some value (0.5 here) fo form ifs new value. Sfep 3 confinues 
fhe inflafion process, allowing fhe sender fo send addifional packefs (assuming 
awnd is nof exceeded). In sfep 4, fhe TCP is assumed fo have recovered, so fhe 
femporary inflafion is removed (and so fhis sfep is somefimes called "deflafion"). 
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Slow start is always used in two cases: when a new connection is started, and 
when a retransmission timeout occurs, it can also be invoked when a sender has 
been idle for a relatively long time or there is some other reason to suspect that 
cwnd may not accurately reflect the current network congestion state (see Section 
16.3.5). In this case, the initial value of cwnd is set to the restart window (RW). In 
[RFC5681], the recommended value of RW = min(JW, cwnd). Other than this case, 
when slow start is invoked, cwnd is set to IW. 


16.3 Evolution of the Standard Algorithms 

The classic and standard TCP algorithms made a tremendous contribution to the 
operation of TCP, essentially addressing the major problem of Internet congestion 
collapse. 


Note 

The problem of Internet congestion collapse was a serious concern during the 
years 1986-1988. In October 1986 the NSFNET backbone, an important compo¬ 
nent of the early Internet, had been observed to operate with an effective capac¬ 
ity some 1000 times less than It should have (called the “NSFNET meltdown”). 
The primary reason for the problem was aggressive retransmissions during times 
of loss without any controls. This behavior drove the network Into a persistently 
congested state where packet loss was massive (causing more retransmissions) 
and throughput was low. Adoption of the classic congestion control algorithms 
effectively eliminated this problem. 


Plowever, there remained several areas for improvement. Given TCP's popu¬ 
larity, a growing amount of effort was put into ensuring that TCP could be made 
to work well under a wider range of conditions. We now mention several of these 
that are found in many TCP implementations today. 


16.3.1 NewReno 

One problem with fast recovery is that when multiple packets are dropped in 
a window of data, once one packet is recovered (i.e., successfully delivered and 
ACKed), a good ACK can be received at the sender that causes the temporary 
window inflation in fast recovery to be erased before all the packets that were lost 
have been retransmitted. ACKs that trigger this behavior are called partial ACKs. 
A Reno TCP reacting to a partial ACK by reducing its inflated congestion window 
can go idle until a retransmission timer fires. To understand why this happens, 
recall that (non-SACK) TCP depends on the signal of three (or dupthresh) duplicate 
ACKs to trigger its fast retransmit procedure. If there are not enough packets in 
the network, it is not possible to trigger this procedure on packet loss, ultimately 
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leading to the expiration of the retransmission timer and invocation of fhe slow 
sfarf procedure, which drasfically impacfs TCP fhroughpuf performance. 

To address fhis problem wifh Reno, a modificafion called NeivReno [RFC3782] 
has been developed. This procedure modifies fasf recovery by keeping frack of fhe 
highesf sequence number from fhe lasf fransmiffed window of dafa (fhe recovery 
poinf, which we firsf saw in Chapfer 14). Only when an ACK wifh an ACK num¬ 
ber af leasf as large as fhe recovery poinf is received is fhe inflafion of fasf recov¬ 
ery removed. This allows a TCP fo confinue sending one segmenf for each ACK 
if receives while recovering and reduces fhe occurrence of refransmission fime- 
oufs, especially when mulfiple packefs are dropped in a single window of dafa. 
NewReno is a popular varianf of modern TCPs—if does nof suffer from fhe prob¬ 
lems of fhe original fasf recovery and is significanfly less complicafed fo imple- 
menf fhan SACKs. Wifh SACKs, however, a TCP can perform beffer fhan NewReno 
when mulfiple packefs are losf in a window of dafa, buf doing fhis requires careful 
affenfion fo fhe congesfion confrol procedures, which we discuss nexf. 

16.3.2 TCP Congestion Control with SACK 

With the introduction of SACKs and selective repeat to TCP, a sender is able to 
make better decisions about what segments to send in order to fill holes at the 
receiver (see Chapter 14). In filling the receiver's holes, the sender generally sends 
each of the missing segments, in order, until all of the retransmissions for the lost 
segments have been received successfully. This procedure differs from the basic 
fast retransmit/recovery procedure mentioned previously in a somewhat subtle 
way. 

In the case of fast retransmit/recovery, when a packet is lost, the sending TCP 
transmits only the segment it believes is lost and is able to send new data if the 
window W allows. Because the window is inflated for each arriving ACK dur¬ 
ing fast recovery, with larger windows TCP typically is able to send some addi¬ 
tional data after performing its retransmission. With SACK TCP, the sender can be 
informed of multiple missing segments and would theoretically be able to send 
them all immediately because they would all be in the valid window. However, 
this might involve sending too much data into the network at once, thereby com¬ 
promising the congestion control. The following issue arises with SACK TCP: 
using only cwnd as a bound on the sender's sliding window to indicate how many 
(and which) packets to send during recovery periods is not sufficient. Instead, the 
selection of which packets to send needs to be decoupled from the choice of when 
to send them. Said another way, SACK TCP underscores the need to separate the 
congestion management from the selection and mechanism of packet retransmis¬ 
sion. Conventional (non-SACK) TCP mixes these together. 

One way to implement this decoupling is to have a TCP keep track of how 
much data it has injected into the network separately from the maintenance of 
the window. In [RFC3517] this is called the pipe variable, an estimate of the flight 
size. Importantly, the pipe variable counts bytes (or packets, depending on the 
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implementation) of transmissions and retransmissions, provided they are not 
known to be lost. Assuming a large value of awnd, a SACK TCP is permitted to 
send a segment anytime the following relationship holds true: cwnd - pipe > SMSS. 
In other words, cwnd is still used to place a limit on the amount of data that can 
be outstanding in the network, but the amount of data estimated to be in the net¬ 
work is accounted for separately from the window itself. How SACK TCP using 
this approach to congestion control compares with conventional TCP was first 
explored in detail with a series of simulations in [FF96]. 

16.3.3 Forward Acknowledgment (PACK) and Rate Halving 

For TCP variants based on Reno (including NewReno), the typical behavior is that 
when cwnd is reduced after a fast retransmit, ACKs for at least one-half of the 
current window's outstanding data must be received before the sending TCP is 
allowed to continue transmitting. This is an expected consequence of reducing 
the congestion window by half immediately when a loss is detected. It causes the 
sending TCP to wait for about half of an RTT and then send any new data during 
the second half of the same RTT, a more bursty behavior than is really required. 

In an effort to avoid the initial pause after loss but not violate the convention 
of emerging from recovery with a congestion window set to half of its size on 
entry, forward acknowledgment (PACK) was described in [MM96]. It consists of two 
algorithms called "overdamping" and "rampdown." Since the initial proposal, 
the authors updated their approach to form a unified and improved algorithm 
they call rate halving, based on earlier work by Hoe [H96]. To ensure that it works 
as effectively as possible, they further govern its behavior by adding bounding 
parameters, resulting in the complete algorithm being called Rate-Halving with 
Bounding Parameters (RHBP) [PSCRH]. 

The basic operation of RHBP allows the TCP sender to send one packet for 
every two duplicate ACKs it receives during one RTT. This causes the recovering 
TCP to have sent the appropriate amount of data by the end of the recovery period, 
but it spaces or paces this data evenly, rather than bunching all the transmissions 
into the second half of the RTT period. Avoiding the bunching or burstiness is 
advantageous because bursts tend to persist across multiple RTTs, stressing router 
buffers more than required. 

To keep an accurate estimate of the flight size, RHBP uses information from 
SACKs to determine the PACK: the highest sequence number known to have 
reached the receiver, plus 1. Taking the difference between the highest sequence 
number about to be sent by the sender (SND.NXT in Figure 15-9) and the PACK 
gives an estimate of the flight size, not including retransmissions. 

With RHBP, a distinction is made between the adjustment interval (the period 
when cwnd is modified) and the repair interval (when some segments are retrans¬ 
mitted). The adjustment interval is entered immediately upon a loss or conges¬ 
tion indicator. The final value for cwnd when the interval completes is half of 
the correctly delivered portion of the window of data in the network at the time 
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of detection. The following expression allows the RHBP sender to transmit, if 
satisfied: 


{SND.NXT -fack + retran_data + len) < cwnd 

This expression captures the flight size, including retransmissions, and 
ensures that if injecting another packet of length len, cwnd will not be exceeded. 
Provided all the data prior to the FACK is indeed no longer in the network (i.e., 
is lost or stored at the receiver), this causes the SACK sender to be appropriately 
controlled by cwnd. However, it can be overly aggressive if packets have been reor¬ 
dered in the network because the holes indicated by SACK have not been lost. 

In Linux, FACK and rate halving are implemented and enabled by default. 
FACK is activated only when SACK is enabled and the Boolean configuration vari¬ 
able net. ipv4. tcp_f ack is set to 1. When reordering is detected in the network, 
the more aggressive behavior of FACK is disabled. 

Rate halving is one of several ways of pacing TCP's sending procedure to 
avoid or limit burstiness. Although it offers a number of benefits, it also has a few 
problems. In [ASAOO], the authors analyze TCP pacing in some detail using simu¬ 
lations, concluding that in many cases it offers inferior performance to TCP Reno. 
Furthermore, rate-halving TCP has been known to exhibit poor performance when 
the connection may become limited by the receiver's advertised window [MM05]. 

16.3.4 Limited Transmit 

In [RFC3042], the authors propose limited transmit, a small modification to TCP 
designed to help it perform better when the usable window is small. Recall from 
the experience with Reno TCP that when operating with a small window, there 
may not be enough packets in the network to trigger the fast retransmit/recovery 
algorithms when loss occurs, as these algorithms typically require three duplicate 
ACKs to be observed prior to initiation. 

With limited transmit, a TCP with unsent data is permitted to send a new 
packet for each pair of consecutive duplicate ACKs it receives. Doing this helps 
to keep at least a minimal number of packets in the network—enough so that 
fast retransmit can be triggered upon packet loss. This is advantageous to TCP 
because waiting for an RTO (which can be a relatively large amount of time—sev¬ 
eral hundred milliseconds) can degrade throughput performance considerably. 
As of [RFC5681], limited transmit is now a recommended TCP behavior. Note that 
rate halving is one form of limited transmit. 

16.3.5 Congestion Window Vaiidation (CWV) 

One of the issues with congestion management in TCP arises when the TCP 
sender stops sending for a period of time, either because it has no more data to 
send, or because it has been prevented from sending when it wants to for some 
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other reason. If all goes well, a sender never pauses, and it continues sending data 
and receiving ACKs from its peer. This continuous feedback enables it to keep a 
reasonably current (within one RTT) estimate of what cwnd and ssthresh should be. 

If the TCP sender has been sending for some time, its cwnd may have grown 
to a substantial size. If it then fails to send for some time but resumes later, the 
large cwnd may allow the sender to inject an undesirably large number of packets 
(i.e., a high-rate burst) into the network without delay. Furthermore, if the pause 
is sufficiently long, its last cwnd value may no longer be appropriate for the path 
and congestion state. 

In [RFC2861], the authors propose an experimental Congestion Window Valida¬ 
tion (CWV) mechanism. Essentially, the sender's current value of cwnd decays over 
a period of nonuse, and ssthresh maintains the "memory" of it prior to the initia¬ 
tion of the decay. To understand the scheme, a distinction is made between an idle 
sender and an application-limited sender. The idle sender has stopped producing 
data it wants to send into the network; ACKs for all the data it has sent so far 
have been received. Thus, the connection is truly quiescent—no data is flowing, 
so no ACKs are either, except for occasional window updates (see Chapter 15). The 
application-limited sender does have more data to send but has been unable to 
for some reason. This could be because the sending computer is busy doing other 
tasks, or because some mechanism or protocol layer below TCP is preventing data 
from being sent. This case results in underutilization of the allowed congestion 
window, but the connection is not completely quiescent. In particular, ACKs may 
still be returning for previously sent data. 

The CWV algorithm work as follows: Whenever a new packet is to be sent, the 
time since the last send event is measured to determine if it exceeds one RTO. If so, 

• ssthresh is modified but not reduced—it is set to max{ssthresh, {3/4:)*cwnd). 

• cwnd is reduced by half for each RTT of idle time but is always at least 1 
SMSS. 

For application-limited periods that are not idle, the following similar behav¬ 
ior is used: 

• The amount of window actually used is stored in W_used. 

• ssthresh is modified but not reduced—it is set to max{ssthresh, {3/4:)*cwnd). 

• cwnd is set to the average of cwnd and W_used. 

Both of these changes decay the value of cwnd while "remembering" it in 
ssthresh. The first case can dramatically affect cwnd in one operation, if the applica¬ 
tion has been idle for a long time. Handling the congestion window in this way 
can lead to better performance under some circumstances. As the authors report, 
reducing the burst of packets that can arise after an idle period eases the pressure 
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on potentially limited buffer space in roufers, ulfimafely leading fo fewer dropped 
packefs. Nofe fhaf because civnd is decayed and ssthresh is nof, fhe fypical conse¬ 
quence of applying fhis algorifhm is fo place fhe sender info slow sfarf affer a long 
enough pause. CWV is enabled by defaulf in Linux TCP implemenfafions. 


16.4 Handling Spurious RTOs—the Eifel Response Algorithm 

As we saw in Chapfer 15, when TCP encounfers a large delay spike, if can expe¬ 
rience a refransmission fimeouf even if no packef has been losf. Such spurious 
refransmissions arise in a number of circumsfances relafing fo changes in fhe 
underlying link layer (such as cellular handoff) or sudden onsef of severe conges- 
fion confribufing fo a large increase in RTT. When fhis happens, fhe TCP adjusfs 
ssthresh and enfers slow sfarf by seffing cwnd fo IW. If no packefs have been losf, 
ACKs arriving affer fhe RTO cause cwnd fo grow relafively quickly, buf TCP sfill 
sends unnecessary refransmissions and underufilizes fhe capacify unfil cwnd and 
ssthresh reseffle. 

To avoid fhe performance problems associafed wifh spurious refransmissions, 
several mefhods have been proposed fo defecf fhem. We discussed some of fhem 
(e.g., DSACK, Eifel, F-RTO) in Chapfer 14. Any one of fhese, or possibly ofhers 
fhaf may be developed, can be coupled wifh a response algorithm used fo "undo" 
fhe changes TCP makes fo ifs congesfion confrol variables affer a fimeouf. One 
popular (i.e., in fhe IETF sfandards frack) response algorifhm is fhe Eifel Response 
Algorithm [RFC4015]. 

Eifel comprises bofh a defecfion and a response algorifhm, which are logically 
disjoin!. Any TCP implemenfafion using fhe Eifel Response Algorifhm is com¬ 
pelled fo use some defecfion algorifhm specified in a sfandards-frack or experi- 
menfal RFC (i.e., one fhaf is documenfed). 

The Eifel Response Algorifhm is aimed af handling fhe refransmission fimer 
and congesfion confrol sfafe affer a refransmission fimer has expired. Here we 
discuss only fhe congesfion-relafed porfions of fhe response algorifhm. If is inifi- 
afed affer fhe firs! fimeouf-based refransmission is senf. Ifs purpose is fo undo a 
change fo ssthresh when a refransmission is deemed fo be spurious. In all cases, 
before ssthresh is modified as a resulf of fhe RTO, if is capfured in a special variable 
as follows: pipe_prev = miniflight size, ssthresh). Once fhis has been accomplished, 
a defecfion algorifhm, such as one of fhose menfioned previously, is invoked in 
order fo defermine if fhe RTO is spurious. If if is, fhe following sfeps are execufed 
when an ACK arrives affer fhe refransmission: 

1. If a received good ACK includes an ECN-Echo flag, stop (see Secfion 16.11). 

2. cwnd = flight size + mm{bytes_acked, IW) (assuming cwnd is measured in 
bytes). 

3. ssthresh = pipe_prev. 
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The pipe_prev variable is set before ssthresh is changed in the ordinary way. It 
provides a memory for ssthresh, so that it can be reinstantiated in step 3 if necessary. 
Step 1 deals with the case of an arriving ACK carrying the ECN flag. (We discuss 
ECN more in Section 16.11.) When this happens, it is considered unsafe to avoid 
undoing the reduction of ssthresh, so the algorithm terminates. Steps 2 and 3 consti¬ 
tute the important part of the algorithm (with respect to cwnd). Step 2 restores cwnd 
to a point where it may be able to inject some additional traffic into the network, but 
not more than JW new data. JW is considered a safe amount of data to inject into a 
network path with unknown congestion state. Step 3 restores ssthresh to its value 
before the RTO occurred, completing the undo operation. 


16.5 An Extended Example 

We now turn to an extended example to demonstrate most of the behaviors 
described in the preceding sections. Using the sock program, we arrange to send 
about 2.5MB of data from a Linux (2.6) sender to a EreeBSD (5.4) receiver over a 
DSL line. The DSL line is rate-limited in this direction to approximately 300Kb/s. 
The EreeBSD receiver is attached to a high-bandwidth connection. The minimum 
RTT between sender and receiver is 15.9ms, and there are 17 hops in the path. The 
systems are configured to use the baseline algorithms (i.e., slow start and conges¬ 
tion avoidance) for most of their processing. This avoids many of the operating- 
system-specific details. (We cover some of these later.) To set up this experiment, 
we run the following command at the receiver: 


FreeBSD% sock -i -r 32768 -R 233016 -s 6666 


This command arranges for the sock program to use a fairly large socket receive 
buffer (228KB) and perform fairly large application reads (32KB). Eor the path 
used, this is an adequate size of buffer for the receiver. At the sender we run the 
sock program in sending mode, as follows: 

Linux% sock -n20 -i -w 131072 -S 262144 128.32.37.219 6666 

This selects a large send buffer and sends 20*131,072 bytes (2.5MB) of data. The 
packet trace is captured using tcpdump on the sender. The command used to 
capture this trace is as follows: 

Linux# tcpdump -s 128 -w sack-to-free-12.td port 6666 

This ensures that at least 128 bytes of each packet are captured, plenty to capture 
all interesting TCP and IP header information. After the trace is collected, we can 
use the tcptrace tool [TCPTRACE] to get a number of useful summary statistics 
regarding the connection: 


Linux% tcptrace -Wl sack-to-free-12.td 
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This command requests the program to provide information on the congestion 
window and output using a long (verbose) format. It produces the following 
oufpuf: 

1 arg remaining, starting with 'sack-to-free-12.td' 

Ostermann's tcptrace -- version 6.6.7 -- Thu Nov 4, 2004 

3175 packets seen, 3175 TCP packets traced 

elapsed wallclock time: 0:00:00.167213, 18987 pkts/sec analyzed 
trace file elapsed time: 0:01:40.475872 
TCP connection info: 

1 TCP connection traced: 

TCP connection 1: 


host a: 

adsl-63-203-72-138. 

dsl.snfc21.pacbell.net 

:1059 


host b: 

dwight.CS. 

. Berkeley. 

EDU:6666 



complete conn: 

yes 





first packet: 

Wed Sep 28 22:15:29 

.956897 2005 



last packet: 

Wed Sep 28 22:17:10 

.432769 2005 



elapsed time: 

0:01:40.475872 




total packets: 

3175 





filename: 

sack-to-free-12.td 




a->b: 



b->a: 



total packets: 

1903 


total packets: 

1272 


ack pkts sent: 

1902 


ack pkts sent: 

1272 


pure acks sent: 

2 


pure acks sent: 

1270 


sack pkts sent: 

0 


sack pkts sent: 

79 


dsack pkts sent: 

0 


dsack pkts sent: 

0 


max sack blks/ack: 

0 


max sack blks/ack: 

2 


unique bytes sent: 

2621440 


unique bytes sent: 

0 


actual data pkts: 

1900 


actual data pkts: 

0 


actual data bytes: 

2659240 


actual data bytes: 

0 


rexmt data pkts: 

27 


rexmt data pkts: 

0 


rexmt data bytes: 

37800 


rexmt data bytes: 

0 


zwnd probe pkts: 

0 


zwnd probe pkts: 

0 


zwnd probe bytes: 

0 


zwnd probe bytes: 

0 


outoforder pkts: 

0 


outoforder pkts: 

0 


pushed data pkts: 

44 


pushed data pkts: 

0 


SYN/FIN pkts sent: 

1/1 


SYN/FIN pkts sent: 

1/1 


req 1323 ws/ts: 

Y/Y 


req 1323 ws/ts: 

Y/Y 


adv wind scale: 

2 


adv wind scale: 

2 


req sack: 

Y 


req sack: 

Y 


sacks sent: 

0 


sacks sent: 

79 


urgent data pkts: 

0 

pkts 

urgent data pkts: 

0 

pkts 

urgent data bytes: 

0 

bytes 

urgent data bytes: 

0 

bytes 

mss requested: 

1412 

bytes 

mss requested: 

1460 

bytes 

max segm size: 

1400 

bytes 

max segm size: 

0 

bytes 

min segm size: 

640 

bytes 

min segm size: 

0 

bytes 

avg segm size: 

1399 

bytes 

avg segm size: 

0 

bytes 

max win adv: 

5808 

bytes 

max win adv: 

233016 

bytes 

min win adv: 

5808 

bytes 

min win adv: 

170016 

bytes 

zero win adv: 

0 

times 

zero win adv: 

0 

times 

avg win adv: 

5808 

bytes 

avg win adv: 

232268 

bytes 
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max owin: 

137201 bytes 

max owin: 

1 bytes 

min non-zero owin: 

1 bytes 

min non-zero owin: 

1 bytes 

avg owin: 

37594 bytes 

avg owin: 

1 bytes 

wavg owin: 

33285 bytes 

wavg owin: 

0 bytes 

initial window: 

2800 bytes 

initial window: 

0 bytes 

initial window: 

2 pkts 

initial window: 

0 pkts 

ttl stream length: 

2621440 bytes 

ttl stream length: 

0 bytes 

missed data: 

0 bytes 

missed data: 

0 bytes 

truncated data: 

2556640 bytes 

truncated data: 

0 bytes 

truncated packets: 

1900 pkts 

truncated packets: 

0 pkts 

data xmit time: 

99.631 secs 

data xmit time: 

0.000 secs 

idletime max: 

7778.8 ms 

idletime max: 

7930.4 ms 

throughput: 

26090 Bps 

throughput: 

0 Bps 


From this useful tool we can learn quite a bit about the connection. We are 
primarily interested in the left portion of the output (a->b). First of all, we see that 
1903 packets were sent in the a->b direction and 1902 of them were ACKs. This is 
expected, as the very first packet is normally a SYN—the only packet without the 
ACK flag turned on. Pure ACKs refer to packets containing no data. The sender 
produces one of these early in the connection, when providing an ACK to its peer's 
SYN + ACK and when producing the final ACK when the connection is closed, 
so this is also expected. In the second column (b->a direction), we find that the 
receiver sent 1272 packets, all of which are ACKs. Of these, 1270 were pure ACKs, 
and 79 SACK packets (i.e., ACKs containing the SACK option) were sent. The two 
"non-pure" ACKs are the SYN + ACK and the FIN + ACK sent at the beginning 
and end of the connection, respectively. 

The next five values indicate the proportion of data that was retransmitted. As 
we can see, 2,621,440 unique bytes were sent (i.e., not retransmitted), but 2,659,240 
bytes were sent in total, meaning some 2,659,240 - 2,621,440 = 37,800 bytes must 
have been sent more than once. The next two fields confirm this and indicate 
that these retransmitted bytes were contained in 27 retransmitted packets, for an 
average retransmitted segment size of 1399 bytes. Because this connection trans¬ 
ferred 2,659,240 bytes in 100.476s, its average throughput is 26,466 bytes/s (about 
212Kb/s). Its average goodput, the amount of unretransmitted data transferred 
per unit time, is 2,621,440/100.476 = 26,090 B/s, about 209Kb/s. As we shall see, this 
connection experiences a number of significant disruptions to its normal opera¬ 
tion. We shall use Wireshark's analysis capabilities and our own analysis to follow 
TCP's behavior when such events occur. 

To get a visual image of the trace, we can use the Statistics I TCP Stream 
Graph I Time-Sequence Graph (tcptrace) function in Wireshark's Statistics 
menu to obtain the image shown in Figure 16-4 (enhanced with arrows for the 
discussion that follows). 

The y-axis of Figure 16-4 represents the relative TGP sequence number. Each 
small tick mark represents 100,000 sequence numbers. The j-axis is time, in sec¬ 
onds. The dark solid line comprises many smaller I-shaped line segments, each 
of which represents the range of sequence numbers contained in a TGP segment. 
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5! TCP Graph 1: sack to free 12.td 63.203.72,138:1059 > 169.229.62.97:6666 



Figure 16-4 Wireshark trace of a 2.5MB file upload executed by a Linux 2.6.10 TCP sender over a 
DSL line rate-limited to approximately 300Kb/s. The dark line represents sent sequence 
numbers. The top line is the highest sequence number advertised by the receiver (its 
right window edge), and the lower line represents the highest segment acknowledged 
by the receiver so far seen at the sender. The 11 events labeled represent cases where the 
congestion window has been modified. 


The height of the I indicates the user-data payload size, in bytes. The slope of the 
"line" formed by these I-shaped characters is the data rate achieved by the con¬ 
nection. Any movement to the lower right indicates a retransmission. The slope of 
the line for any given time range provides the average throughput over that time. 
As we can see, the highest sequence number sent was about 2600000 at time 100, 
which provides for a rough average goodput rate of 26,000 bytes/s, quite close to 
the numeric value from the preceding tcptrace output. 

The top line is the largest sequence number the receiver is willing to accept 
(its highest advertised window) so far. As we can see, at the beginning of the time 
series, this line is at about 250000, with the actual value being 233016, as indi¬ 
cated in the tcptrace output, in the b—^a column. The bottom line represents the 
highest ACK number received at the sender so far. As discussed previously, TCP 
searches for additional bandwidth while it operates, by increasing its congestion 
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window. It does not violate the receiver's advertised window. We see this in opera¬ 
tion in this graph as the solid line moves from the lower line toward the upper 
line over time. If fhe upper line is never reached, eifher fhe sender or fhe usable 
nefwork capacify is fhe limifing factor for fhe fhroughpuf of fhe connecfion. If fhe 
upper line is always reached, fhe receiver's window is fhe likely limifing facfor. 

16.5.1 Slow Start Behavior 

We begin our analysis by observing the operation of the slow start algorithm 
described earlier. In Wireshark, we select the first packet of the trace and then 
use its Statistics I Flow graph function to illustrate the packets exchanged at the 
beginning of the connection (see Figure 16-5). 


C sack-to-free-12.td - Graph Analysis 
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:Seq = 1401 Ack = 1 

0,210 


ACK 
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ACK-Len: 1400. 


Seq = 2301 Ack = 1 

0.216 


ACK-len; 1400, 


Seq s 4201 Ack s 1 

0.217 


ACK-len: 1400 


Seq s S801 Ack s 1 

0,264 


ACK 


, Seq = 1 Ack = 4201 

0,268 


ACK-len; 1400, 


Seq s 7001 Ack s 1 

0,268 


ACK-len:1400 


Seq s 8401 Ack = 1 

0,335 


ACK 


, Seq = 1 Ack = 7001 

0,340 


ACK-lftn; 1400, 


Seq = 9301 Ack = 1 

0,340 


ACK-Len: 1400. 


Seq = 11201 Ack = 1 

0,341 


ACK-ten; 1400 


. Seq » 12801 Ack » 1 


> 

> 


[ Save As | 


Figure 16-5 The Wireshark analysis shows the sequence and ACK numbers exchanged when the 
connection is first established. Each ACK received at the sender liberates two or three 
packets. This characteristic is typical of a sender in slow start. 


Here we see the initial SYN and SYN + ACK exchange. The ACK at time 0.032 
is a window update (see Chapter 15). The first two data packets are sent at times 
0.126 and 0.127. The ACK at time 0.210 is not for a single packet. Its ACK num¬ 
ber is 2801 and thus ACKs both of the previously sent data packets because of the 
cumulative property of TCP ACKs. This is an example of delayed ACKs, which are 
often generated for every other data packet (or more frequently, as recommended 
by [RFC5681]). As we shall see for this particular (FreeBSD 5.4) receiver, it alter¬ 
nates between ACKing one packet and two packets. This means there are two 
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ACKs returned for every three data packets sent on average (assuming no errors or 
retransmissions). We discussed delayed ACKs and window updates in Chapter 15. 

An ACK arriving that covers two packets allows the sliding window at the 
sender to move forward by two packets and therefore permits two additional 
packets to be sent into the network. However, because this connection is just start¬ 
ing out and it is still executing slow start, the arrival of a good ACK causes the 
sender to increase its congestion window by one packet (this Linux TCP manages 
its congestion window in packet units). In this case, the cwnd grows from 2 to 3. 
This has effect of allowing three packets to be sent overall as a result of the arriving 
ACK. They are sent at times 0.215, 0.216, and 0.217. 

The ACK arriving at time 0.264 ACKs a single packet and indicates that the 
receiver next expects to see sequence number 4201. That packet, however, and the 
one after it with sequence number 5601, have already been sent and are still out¬ 
standing. Thus, the ACK arrival allows cwnd to grow from 3 to 4, but because 
two packets are already outstanding, only two more are allowed to be sent (one 
because the ACK slid the window forward, another because the received good 
ACK allowed cwnd to grow by one packet). They are sent at times 0.268 and 0.268 
(within the same 1/lOOOs). 

This startup behavior is typical of a sender executing slow start with a receiver 
delaying ACKs. The process continues in this fashion (each ACK liberating two 
or three packets) until something interesting occurs at about time 5.6. We now 
explore this further. 

16.5.2 Sender Pause and Local Congestion (Event 1) 

Looking at Figure 16-4, we find that after a segment is sent at time 5.512, a pause 
occurs until the next data segment is set at time 6.162. This can be better seen by 
using Wireshark's graphical zoom-in feature as shown in Figure 16-6. 

In this figure we see that the sender has stopped sending, no retransmitted 
packet appears to be present, yet the data rate appears to decrease after the pause. 
Why is this? We can investigate further with the flow trace function once again 
(see Figure 16-7). 

The sending TCP has evidently ceased its sending demand at time 5.559. This 
is supported by the fact that the last transmitted data segment before the pause 
has the PSH flag turned on, which typically indicates that the sending buffer has 
been emptied. There could be several reasons for this, including the possibility 
that the host system is busy doing something else, preventing the sending applica¬ 
tion from initiating its next write of data into the network. 

We can observe that this pause is not the beginning of a retransmission recov¬ 
ery period, yet the slope of the line decreases after the pause, indicating a reduced 
sending rate. Let us explore this behavior more closely to figure out why. 

The last sequence number sent before the pause is 343001 + 1400 - 1 = 344400, 
which has never been sent before and is therefore not a retransmission. After the 
segment is sent at time 5.486 (highlighted), this connection will have its greatest 
amount of outstanding data: 341,601 + 1400 - 205,801 = 137,200 bytes (98 packets). 
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C TCP Graph 1 
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Sequence 

number[B3 

Time/Sequence Graph (tcptrace) 


- 


j 

500000 — 



- 



- 


-1 

_ 


j ■ 

_ 



450000 — 



400000 — 



- 


, t* ** 

: 

j 


350000 — 

j 

r 

_ 

^ f* ‘ 



j ■* 1* 






.j 

f’ 


300000 — 



- 

1 * 

- 

- 


^ - 

“ 



250000 — 

, 1* 

^ " 

- 

.r'' 


“ 




, r -- 


- 

,1 -j"* 

,r 



i-*'' 


150000 — 

,1 



,1 _ 



' 1 1 1 1 1 1 1 1 *1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I 1 
1.5 2.0 2.5 3,0 3,5 4,0 4.5 5.0 5,5 6,0 

1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1- 

6,5 7.0 7.5 



Tlme[s] 


Figure 16-6 After starting using the slow start procedure, the connection pauses for about 512ms, at 
time 5.512, and then continues by sending a burst. 


This tells us that the value of cwnd is 98 packets. The arrival of fhe ACK af fime 
5.556 indicafes fhaf fwo more packefs have been received af fhe receiver. The lasf 
packef fo be senf before fhe pause confains sequence number 344400, so 97 packefs 
are oufsfanding. 

While fhe applicafion is paused, 11 ACKs arrive (each alfernafing befween 
ACKing eifher one or fwo full-size segmenfs as menfioned before). The lasf one 
indicafes fhaf sequence number 233800 has been received, meaning 110,600 byfes 
(79 packefs) now remain oufsfanding. Af fhis poinf, fhe sender wakes up and con- 
finues fo fransmif. As a resulf of receiving fhis ACK af fime 6.204, if should be able 
fo injecf 98 - 79 = 19 more packefs af fhis poinf buf is able fo send only 8. The lasf 
sequence number if is able fo send is 354201 + 1400 - 1 = 355600 af fime 6.128. 

Whaf happens fo fhe TCP af fhis poinf is nof immediafely obvious from fhe 
frace. We would have expecfed 19 packefs fo be senf, buf only 8 were. The reason 
is fhaf fhe sender filled a local (lower-layer) queue wifh ifs bursf of packefs and fhe 
subsequenf ones were unable fo be senf. Using fhe following command in Linux, 
and knowing fhaf our fransfer fakes place over fhe pppO nefwork inferface, we 
can fry fo defermine if some lower layer has caused TCP fo have problems: 
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Comment 


ISeqs lAclc = 205801 
js«q»3<)0201 Ack = 1 
I Seqs 341801 Ack s 1 


Acks 208801 
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Seqs 1 Ack 3 210001 


Ack s 212801 


Ack = 214201 


Ack = 217001 


Ack = 218401 
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Ack = 228801 
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jSeq = 345801 Ack = 1 
Seq = 347201 Ack = 1 
I Seq = 348801 Ack = 1 
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|seq= 1 Ack = 235201 


I S«q = 355801 Ack = 1 


Ack = 238001 


I Seq = 357001 Ack 


Figure 16-7 The sender pauses at time 5.559. In addition, the burst of packets at time 6.209 is limited 
to eight because of local congestion. Some TCP implementations such as this one limit 
the sending rate to avoid congesting queues on the sending host. 


Linux% tc -s -d qdisc show dev pppO 

qdisc pfifo_fast 0: bands 3 priomap 1222120011111111 
Sent 122569547 bytes 348574 pkts (dropped 2, overlimits 0 requeues 0) 

The tc program is used to administer the packet scheduling and traffic control 
subsystem in Linux [LARTC]. The -s and -d options provide detailed statistics. 
The directive qdisc show dev pppO means the queuing discipline for device 
pppO should be displayed, which is the method used to hold and prioritize the 
order in which packets are sent. Notice the two dropped packets. These packets 
were not dropped in the network but rather in the sending computer in a protocol 
layer below TCP. Furthermore, because they were dropped in a layer below TCP 
but above the layer where the packet capture facility operates, these packet trans¬ 
mission attempts are not visible in the trace. Dropping transmitted TCP packets at 
the sending system is sometimes called local congestion, and it arises because TCP 
is producing data faster than the underlying local queues can be emptied. 
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Note 

The Linux traffic controi subsystem and other priority or QoS features supported 
in routers and operating systems (e.g., Microsoft’s qWave API [WQOS]) support 
different queuing disciplines that may order packets differently based on features 
in the packets (e.g., the IP DSCP value or TCP port number). Placing priority 
on some packets (e.g., multimedia data packets, TCP pure ACKs) may improve 
the user experience for interactive applications in networks that support priority. 
Much of the Internet does not support such priorities, but many LANs and some 
enterprise IP networks do. 


Local congestion is one of several reasons the Linux TCP implementation may 
be placed in the Congestion Window Reducing (CWR) state [SK02]. it starts by set¬ 
ting ssthresh to cwnd/2, and by setting cwnd to mm{cwnd, flight size + 1). In the 
CWR state, the sender reduces cwnd by one packet for every two ACKs received 
until cwnd reaches the new ssthresh or the CWR state is exited for some other rea¬ 
son such as a loss event. It is essentially the rate-halving algorithm we mentioned 
previously. It is also invoked when the sending TCP receives an ECN-Echo indica¬ 
tion in the received TCP header (see Section 16.1.1). 

With this knowledge, we can now understand what happened. When TCP 
continues after the pause, it is able to send only 8 packets. Any additional packets 
cannot be sent because of local congestion and instead place the TCP into the CWR 
state. Immediately, ssthresh is reduced to 98/2 = 49 packets and cwnd is set to 79 + 8 
= 87 packets. It then remains in the CWR state where it reduces cwnd by 1 for every 
two ACKs it receives, leading to a reduction in sending rate, until cwnd reaches 66 
packets at time 8.364. 

The reduction in sending rate can also be observed as follows: Looking at Pig- 
ure 16-6, before time 5.5 the slope of the line gives an effective data rate of approxi¬ 
mately 500Kb/s. This is higher than the capacity of the link in the direction of the 
data transfer, so this extra apparent capacity is the result of one or more queues 
being filled up in the path, leading to an increased RTT because of queuing delay. 
We can use the Statistics I TCP Stream Graph I Round Trip Time Graph to visual¬ 
ize this effect (see Pigure 16-8). 

In this figure, the y-axis represents the estimated RTT in seconds and the x-axis 
represents the sequence number. We can see that at approximately sequence num¬ 
ber 340000, the RTT begins to decrease. This sequence number corresponds closely 
to the last sequence number sent before the pause described earlier (344400). The 
decreasing RTT corresponds to the fact that as the sender slows down, the net¬ 
work is becoming less loaded (i.e., the rate at which data is draining from the net¬ 
work exceeds the rate at which new traffic is arriving). This causes queues within 
network routers to empty, leading to a smaller wait time and a consequentially 
lower RTT. 

The sending rate reduction continues while TCP remains in the CWR state. 
Eventually, if this continued, the RTT would decrease to its bare minimum value 
of about 17ms. In general, TCP avoids allowing this to happen because it wants to 
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C TCP Graph 2: sack-lo-free 12.ld 63.203.72.138:1059 -> 169.229.62.97:6666 



Figure 16-8 The sender's estimated connection round-trip time. Periods of increasing RTT (dense 
groupings of increasing values) correspond to buffers filling because of an excess of 
sending rate over forwarding rate at a router along the path. Decreasing RTTs represent 
the opposite effect, resulting from the sender slowing down and the queues draining. 


"keep the pipe full" to ensure that it is using the maximum amount of network 
capacity currently available to it. 

16.5.3 Stretch ACKs and Recovery from Local Congestion 

At time 8.364, following the gradual reduction in cwnd initially caused by the TCP 
entering the CWR state, the TCP appears to start decreasing more quickly This is 
a consequence of a change in the relationship of cwnd and the amount of outstand¬ 
ing data indicated by the ACK at time 8.362 (highlighted in Figure 16-9). 

The ACK at time 8.362 is for sequence number 317801, but the previously 
received ACK is for sequence number 313601, meaning this new ACK is for 317,801 
- 313,601 = 4200 bytes (three packets). This is commonly called a stretch ACK, mean¬ 
ing it ACKs more than twice the largest segment sent so far. It could be caused by 
a number of possibilities, the simplest of which is a lost ACK. It is usually difficult 
to determine with certainty the cause of the stretch ACK, but the precise reason 
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Q sack-to-free-12.td - Graph Analysis 


Time 

8,150 
■8,152 
8,185 
8,187 
8,256 
8,258 
18,362 
8,364 
8,399 
j 8,471 
18,474 
8,508 
■8,579 
8,582 
8,619 
8,692 
8,694 
8,727 
! 8,797 


169,229.62.97 


ACK-Len; 1400 . 


ACK-Len: 1400 . 


ACK-Len: 1400. 


ACK-Len: 1400 . 


ACK-Len; 1400 . 


ACK-Len; 1400 , 


Seq s 1 Ack 
Seq 2 404801 
Seq 2 1 Ack 
Seq = 408001 
Seq = 1 Ack 
Seq = 407401 
Seq = 1 Ack 
Seq = 408801 
Seq = 1 Ack 
Seq = 1 Ack 
Seq = 410201 
Seq = 1 Ack 
Seq = 1 Ack 
Seq = 411801 
Seq = 1 Ack 
Seq = 1 Ack 
Seq = 413001 
Seq = 1 Ack 


308401 
Ack = 1 
310801 
Ack = 1 
313801 
Ack s 1 
317801 
Ack = 1 
319201 
322001 
Ack = 1 
323401 
328201 
Ack = 1 
327601 
330401 
Ack = 1 
331801 
334801 




Figure 16-9 A "stretch ACK" acknowledges three packets' worth of sequence numbers. Such ACKs 
can cause the sender to act in a bursty manner and can occur when other ACKs are lost 
in transit. 


is not usually important. In this example, we can assume that an earlier ACK was 
lost and continue to investigate how the sender behaves. Its arrival causes cwnd to 
drop from 68 to 66. 

The Linux TCP implementation attempts to revise its estimate of fhe number 
of oufsfanding packefs whenever if receives an ACK. (If also affempfs fo validafe 
fhe congesfion window whenever if sends segmenfs, according fo fhe Conges- 
fion Window Validafion algorifhm described previously, buf fhis does nof have 
an effecf here.) When in CWR sfafe, if fhe oufsfanding packef counf esfimafe is 
reduced for some reason, as if is here affer receiving fhe sfrefch ACK, cwnd is 
adjusfed fo be fhe esfimafe plus 1. Nofe fhaf fhis is in addifion fo ifs ordinary 
behavior in CWR, where if reduces cwnd by 1 for each pair of ACKs received. 
Generally, cwnd is reduced by eifher 1 or 0 for each ACK, and fhen cwnd is sef fo 
miniflight size + 1, [possibly reduced] cwnd). The CWR sfafe remains operafing 
unfil cwnd reaches ssthresh or some ofher evenf, such as a loss and refransmission, 
occurs. 

Prior fo receiving fhe sfrefch ACK, af fime 8.258, 407,401 + 1400 - 313,601 = 
95,200 byfes (68 packefs) are oufsfanding. Affer fhe sfrefch ACK is received, fhe 
number of oufsfanding packefs is reduced fo 65 and cwnd is sef fo 66. 

Because fhe flighf size esfimafe and cwnd are closely coupled in fhe CWR 
sfafe, and fhe TCP receiver in fhis example delays ACKs, fhe resulf of a pair of 
ACKs arriving is fo reduce cwnd by 2 and fo liberafe one packef. The reason for 
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this is as follows: Assume that before the arrival of any ACKs, cwnd is Cg and the 
flight size estimate is fg = Cg. When the first ACK arrives (i.e., for one packet),/j = 
/q -1 and cwnd is updated to Cj = min(Cg - 1,/j + 1) = -1. When the second ACK 

arrives (for two packets, because of delayed ACKs),/^ =/j - 2 = - 3 and cwnd is set 

to c^ = min(Cj ,/2 + 1) = min(Cg - 1 ,Cq - 2) = - 2. Because the congestion window has 

shrunk by two packets, but three packets have been ACKed during this period, a 
single packet is liberated after the receipt of the second ACK. 

The sender exits the CWR state at time 9.37 when cwnd reaches ssthresh at 49 
packets. TCP now returns to normal behavior and continues in congestion avoid¬ 
ance (see Figures 16-10 and 16-11). 



Figure 16-10 By time 9.369, the sender reverts to normal and sends either one or two packets per 
received ACK. 


In Figure 16-10, the circled packets indicate where the sender's state changes 
from CWR back to normal, where the congestion avoidance algorithm takes over. 
Figure 16-11 shows this behavior in more detail. 

The sender continues in congestion avoidance, achieving relatively stable 
throughput until time 17.232. At this point, severe network congestion begins to 
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Seq 3 428401 
Seq 3 1 Ack 
Seq 3 420801 
Seq 3 1 Ack 
Seq 3 431201 
Seq = 432801 
Seq 3 1 Ack 
Seq 3 434001 


348801 
351401 
Ack 3 1 
352801 
355801 
Ack 3 1 
Ack 3 1 
357001 
Ack 3 1 
Ack 3 1 
359801 
Ack 3 1 
Ack 3 1 
381201 
Ack 3 1 
384001 
Ack 3 1 
Ack = 1 
385401 
Ack 3 1 


Figure 16-11 TCP has completed its recovery and is back in the normal (congestion avoidance) state. 
It sends one or two packets for each ACK received. 


form, contributing to a large increase in the RTT. In Figure 16-8, this happens at 
sequence number 720000, where the RTT grows to about 6.5s—a more than three¬ 
fold increase from its previously stable value of about 2s. This effect is common 
with the onset of severe congestion. Eventually, the network congestion is suf¬ 
ficiently severe so as to cause a packet to be dropped. The sending TCP responds 
with its first retransmission. 

16.5.4 Fast Retransmission and SACK Recovery (Event 2) 

At time 21.209, after the dramatic increase in measured RTT, we observe the first 
retransmission. We can see this in more detail by zooming in as shown in Figure 16-12. 
The first retransmission (circled) is for the packet starting with sequence number 
690201, matching the highest ACK received so far (also 690201). It is triggered by 
the receipt of a single duplicate ACK carrying the SACK block [698601,700001]. 
Recall that these numbers indicate the sequence number range already received at 
the receiver. In this case, it is a single packet. 

At time 21.209, when the retransmission takes place, the largest sequence 
number sent so far is 761601 + 1400 - 1 = 763000, and cwnd is 52. In conjunction 
with this fast retransmit, ssthresh is reduced from 49 to 26, and TCP enters the 
Recovery state. This TCP remains in Recovery state until it receives a cumula¬ 
tive ACK for the recovery point: sequence number 763000 (or higher). In addition. 
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5 TCP Graph 1: sack to free 12.td 63.203.72,138:1059 -> 169.229.62.97:6666 



Figure 16-12 The first retransmission (circled) occurs at time 21.209. SACK blocks are used to guide 
the sender as to what packets to retransmit. Eight retransmissions in total occur 
between times 21.0 and 22.0. 


cwnd is reduced to (flight size + 1) packets. However, because data has likely been 
lost, determining the flight size is not so straightforward. If is accomplished using 
fhe following relafionship: 

flight size = packets_outstanding + packets_retransmitted - packets_removed 

The firsf ferm on fhe righf-hand side represenfs all fhe packefs senf once by 
fhe sender and nof yef ACKed wifh fhe regular TCP cumulafive ACK field. The 
second ferm represenfs any fhaf have been resenf (and nof ACKed), and fhe final 
ferm represenfs any packefs fhaf are no longer in fhe nefwork buf also have nof 
been ACKed by fhe basic TCP cumulafive ACK. The value of packets_removed musf 
be esfimafed because TCP has no reliable way fo direcfly learn if. If represenfs fhe 
sum of any (ouf-of-order) packefs cached af fhe receiver plus any packefs fhaf have 
been losf in fhe nefwork. Wifh SACK, if is possible fo learn fhe number of packefs 
cached af fhe receiver, buf fhe number of losf packefs musf sfill be esfimafed. 
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The value of packets_outstanding here is (763,001 - 690,201)/1400 = 72,800/1400 
= 52 and the number of packets cached in the receiver is (700,001 - 698,601)/1400 
= 1400/1400 = 1 (derived from the sequence numbers in the SACK block). With 
PACK enabled, as it is here by default, holes in the receiver inferred by SACK infor¬ 
mation are considered to be lost. Thus, in this case, TCP estimates that 698,601 - 
690,201 = 8400 (6 packets) have been lost. The flight size is therefore 52 + 1 - (1 + 6) 
= 46 packets, and cwnd is set to 47. While in the Recovery state, TCP reduces cwnd 
by one packet for every two packets it receives, similar to the CWR state. After the 
first retransmission, another seven retransmissions take place, followed by trans¬ 
mission of new data, based on SACK option data carried in each of the arriving 
ACKs between times 21.2 and 21.7 (see Figure 16-13). 

In this figure, much of the normal Wireshark information has been removed 
to more clearly see the SACK options on each ACK. By looking at the SACK 
sequence numbers (SLE and SRE), we can see that most of the time there are two 
active blocks at the receiver: [698601,700001], which holds one packet, and another 
[702801,763001] (at its largest), that grows to be 43 packets. During the recovery 
period, the general rate-halving algorithm applicable to the CWR and Recovery 
states reduces cwnd by at least one packet for every pair of ACKs received. Because 
each received ACK effectively ACKs one packet in this case (through an increase 
in the SACK block size by one packet), flight size reduces by 1, which would permit 
another packet to be sent. Plowever, because cwnd is also reduced by 1 for every 
other ACK, it takes two ACKs to liberate a new packet. Note how this differs from 
the CWR case. In that case, some ACKs provided acknowledgment for two pack¬ 
ets, whereas here only one packet is ACKed (SACKed) per arriving ACK. Thus, for 
each of the transmissions and retransmissions shown in the plot, cwnd is reduced 
by 1 after each pair of ACKs has been received. During this recovery period, over¬ 
all, cwnd shrinks from 47 to 20. 

Most ACKs containing SACK options are duplicate ACKs for sequence num¬ 
ber 690201 (44 of them), as Wireshark points out. There are five good ACKs that 
contain the SACK blocks [702801,763001] and [698601,700001]. Two more contain 
only the SACK block [702801,763001]. These good ACKs do not take the sender out 
of recovery, because their ACK numbers are all below the sequence number of the 
recovery point at 763000; they are partial ACKs, as discussed earlier. 

TCP recovers from fast retransmit at time 23.301 with the arrival of a good 
ACK equal to a sequence number (765801) larger than the recovery point. At this 
point, cwnd is 20 and ssthresh is 26, meaning TCP is in slow start. By time 23.659, 
after several round trips, cwnd reaches the value 27, TCP is in the normal operat¬ 
ing state, and the congestion avoidance algorithm takes over. This completes the 
sender's first fast retransmit recovery period. 

16.5.5 Additional Local Congestion and Fast Retransmit Events 

The next four events consist of local congestion, a fast retransmit, and two more 
local congestion episodes. They are very similar to the types of events we have 
seen already, so they are summarized here only briefly. 
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Figure 16-13 SACK recovery after fast retransmission. Packet 871 contains the first SACK option used on the connection. 
Subsequent ACKs contain SACK information until packet 950. 
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16.5.5.1 CWR Again (Event 3) 

A CWR event due to local congestion occurs at time 30.745. At this point, 1,090,601 
+ 1400 -1,051,401 = 40,600 (29 packets) are outstanding, and cwnd is 31. This should 
allow two additional packets to be injected, but none are, because of local conges¬ 
tion. In this particular case, cwnd is set to flight size + 1 = 30, and ssthresh is reduced 
to 15. TCP exits the CWR state when cwnd reaches ssthresh. This happens at time 
34.759, after another significant increase in the connection's RTT. 

16.5.5.2 Second Fast Retransmit (Event 4) 

At time 36.914, there is another fast retransmit when cwnd = 16. Using the basic 
display from Wireshark, such refransmissions are easy fo spof (see Figure 16-14). 


^ 5dck-to-free-12.td - Wireshark 


RIe Edit View Go Capture Analyze Statistics Telephony Tools Help 




H0B 


1362 35.829739 1059 > 6666 [ACK] Seq-1128401 Ack-1 win-5808 Len-1400 TSV-17128614 TSER-147588372 [Packet size limited during capture] 

1363 35.984733 6666 > 1059 [ACK] seq=l Ack=1110201 Win=231616 Len=0 TSV=147588387 TSER=17124188 

1364 35.989207 1059 > 6666 [ACK] Seq-1129801 Ack-1 Win-5808 Len-1400 TSV-17128771 TSER-147588387[Packet size limited during capture] 

1365 35.989789 1059 > 6666 [ACK] Seq-1131201 ACk=l Win=5808 Len=1400 TSV-17128771 TSER=147588387[Packet size limited durinq capture] 


1368 36.911794 
1360 36.913879 
. 1370 36.946535 
. 1371 36.948623 

1372 36.983776 

1373 37.022172 

1374 37.024497 

1375 37.057406 

1376 37.094599 

1377 37.097573 

1378 37.129828 

1379 37.163754 


TCP Pup ACK 1366#-1] 
TCP Retransmission] : 




• 1059 [ACK] 

■ 6666 [AC^ 


Seq-1 Ack-11102ul win-233016 Len-0 rsv-147588480 TSER-17124188 SLE-1117201 SRE-1120001 
Seq-in0201 Ack-1 Win-5808 Len-14QQ TSV-17120693 TSER-147588480[Packet si 


TCP Dup ACK 1366#-2] 6666 > 1059 [ACK] seq-1 Ack-11102ul win-233016 Len-0 TSV-147588483 TSER-17124188 SLE-1117201 5RE-1121401 

TCP Retransmission] 1059 > 6666 [ACK] Seq-1111601 Ack-1 Win-5S08 Len=14QQ TSV-17129733 TSER-1475884S3[Packet size limited dur 


TCP Dup ACK 1366#3] I 

TCP Pup ACK 1366»4T~ 


■ 1059 [ACK] 

■ 1059 [AcT^ 


seq-l Ack-11102Q1 win-233016 Len-Q TSV-147588487 TSER-17124188 SLE-1117201 SRE-1122801 

Seq-1 Ack-1110201 Win-233016 Len-Q TSV-147588491 TSER-17124188 SLE-lli7201 SRE-1124201 


TCP Retransmission] 1059 > 6666 [ack] seq-1113Q01 Ack-1 win-5808 Len-14QQ TSV-17129809 tser-147588491 ' 


;tCP Dup ACK 1366#-5] 6666 > 1059 [ACK] Seq-1 Ack-11102Q1 Win-233016 Len-Q TSV-14758S495 TSER-171241S8 SLE-1117201 SRE-1125601 


TCP DUp ACK 1366»6] 6666 > 1059 [ACK] Seq-l ACk-11102Ql Win-233Q16 Len-Q TSV-147588498 TSER-17124188 SLE-1117201 SRE-11270Q1 


.TCP Retransmission] 1059 > 6666 [ack] Seq-11144Q1 Ack-1 win-5808 Len-14u0 TSV-17129881 TSER-147588498[Packet size limited dur 


TCP Dup ACK 1366^^7] 6666 > 1059 [ACK] Seq-l ACk-11102Ql Wln-233Q16 Len-Q TSV-147588502 TSER-17124188 SLE-1117201 SRE-1128401 


TCP Dup ACK 1366#3] 6666 > 1059 [ACK] Seq-l Ack.lll02Ql Win.233016 Len-Q rsV-14758S5Q6 rSER.171241S8 SLE-1117201 SRE-1129501 


TCP Retransmission] 1059 > 6666 [ACK] Sr-q^ni5801 Ack-l win-5808 Len-14QQ TSV-17129955 TSER-147588506[Packet size limited dur 


TCP Pup ACK 1366#9] 6666 > 1059 [ACK] seq-l Ack-11102L'l vnn-2 


;n-Q TSV-1475885C)9 TSER-17124188 5LE-1117201 SRE-1151201 


TCP DUP ACK 136C#lQj 6666 > 1059 fACK] Seq-l Ack-11102ul win-233016 Len-Q TSV-14758S513 TSER-17124188 SLE-11172Q1 SRE-11326Q1 


'j59 fACK] Seq=l Ack=lllu, 




■ TSER-17124188 SLE-lil7201 SRE-1134001 


1385 37.313362 6666 
138637.316390 1059 
1387 37.348839 6666 
138837.383329 6666 
138937.386190 1059 

1390 37.420086 6666 

1391 37.454530 6666 


• 1059 [ACK] seq-l 4:1-1111601 win-231616 Len-0 TSV-147588520 TSER-17129698 SLE-1117201 SRE-1134001 

■ 6666 [ACK] Seq-1135401 Ack-1 Win-5808 Len-1400 TSV-17130100 TSER-147588520[Packet size limited during capture] 

• 1059 [ACK] seq-l Ack-1113001 Win-231616 Len-0 TSV-147588524 TSER-17129733 SLE-1117201 SRE-1134001 

■ 1059 [ACK] Seq-l Ack-1114401 win-231616 Len-0 TSV-147588527 TSER-17129809 SLE-1117201 SRE-1134001 

- 6666 [ACK] seq-1136801 Ack-1 win-5808 Len-1400 TSV-17130170 TSER-I47588527[packet size limited during capture] 

■ 1059 [ACK] Seq-l Ack-1115801 win-231616 Len-0 TSV-147588531 TSER-17129881 SLE-1117201 SRE-1134001 

• 1059 [ACK] Seq=l ACk-1134001 Win-214816 Len-0 TSV-147588534 TSER-17129955 


nndow update 


Q Frame 1366: 86 bytes on Aire ('688 bits), 86 bytes captured (‘688 bitsj 

a Ethernet ii, src: 00:10:67:00:8c:d6 (00:l0:67:00:8c:d6], ost: 00:00:el:08:8c:eb (00:00:el:08:8c:eb) 

Q PPP-over-Ethernet Session 
S point-to-point protocol 

e internet protocol, Src: 169.229.62.97 (169.229.62.97), ost: 63.203.72.138 (63.203.72.138) 

Q rrransmission control. Protocol. src Port: 6666 (6666)" Dst Port: 1059 (1059)7 Seq: 1, Ack: 1110201, Len: 0 
source port: 6666 (6666) 

Destination port: 1059 (1059) 

[stream index: 0] 

Sequence number: 1 (relative sequence number) 

Acknowledgement number: 1110201 (relative ack number) 

Header length: 44 bytes 
a Flags: 0x10 (ack) 

Window size: 233016 (scaled) 
a checksum: 0x3344 [correct] 

S options: (24 bytes) 

NOP 

NOP 

Timestamps: TSval 147588476, TSecr 17124188 

NOP 

NOP 

a sack: 1117201-1118601 
a ILSEQ.vack analysis]" 
a [[TCP Analysis Flags] 
a [Timestamps] 


Figure 16-14 A Linux TCP sender enters the Disorder state upon receiving a duplicate ACK or an ACK with 
SACK information. Packets arriving while in this state trigger transmissions of new data. Subse¬ 
quent duplicate ACKs (or presence of SACK information) place the sender into the Recovery state 
where retransmissions take place. 
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Here, the ACK arriving at time 36.878 (packet 1366) carries the SACK block 
[11172011118601] and ACK number 1110201. This places Linux TCP in the Disorder 
state, where arriving packets liberate one packet each (similar to limited transmit) 
of new data. Packet 1367 is the packet liberated in this case. 

With the arrival of fhe ACK af fime 36.912 (packef 1368), confaining SACK 
block [1117201,1120001] and a duplicafe ACK, TCP enfers fhe Recovery sfafe and 
friggers fhe fasf refransmif af fime 36.914 (packef 1369). The highesf sequence num¬ 
ber senf so far is 1132601 + 1400 - 1 = 113400. Recovery is evenfually complefed 
af fime 37.455, wifh fhe arrival of fhe ACK confaining sequence number 1134001 
(packef 1391). Nofe fhaf immediafely following fhis ACK is a window updafe. For 
bulk dafa fransfers such as fhe presenf example, where fhe receiver's window is 
large relafive fo fhe bandwidfh-delay producf of fhe nefwork, such updafes are nof 
usually of much consequence. When we have inferacfive fraffic, small windows, 
or servers fhaf only occasionally read from fhe nefwork, fhese updafes can become 
quife imporfanf, as we saw in Chapfer 15. When fhe firsf refransmission fakes 
place, ssthresh is reduced from 16 fo 8. Evenfually, when recovery complefes, cwnd 
= 4 and ssthresh = 8. This leaves fhe sender in slow sfarf because cwnd is smaller 
fhan ssthresh. 

16.5.5.3 CWR Again (Events 5 and 6) 

Affer fhe arrival of fhe ACK for sequence number 1359401 af fime 43.356, TCP 
once again enfers fhe CWR sfafe because of local congesfion when if fries fo send 
subsequenf packefs. This ulfimafely reduces ssthresh fo 8 and cwnd becomes 15. A 
second fransmission failure, while in fhe CWR sfafe, brings ssthresh down fo 12. 
The CWR sfafe is exifed wifh cwnd = 7 and ssthresh = 8. 

Anofher round of local congesfion af fime 59.652 forces TCP info CWR when 
cwnd = 19 and ssthresh = 10. In fhis case, fhe CWR sfafe is inferrupfed by a fimeouf 
fhaf places TCP info fhe Loss sfafe. This represenfs a new fype of evenf for us fo 
invesfigafe. 

16.5.6 Timeouts, Retransmissions, and Undoing cwndChanges 

Although TCP keeps a retransmission timer in case fast retransmit is unable to 
repair a loss, we have not yet seen it in operation. This is fortunate, because gener¬ 
ally when a timeout occurs, the connection is experiencing significant congestion 
and performance problems. In the next portion of the trace, shown in Figure 16-15, 
we see how the sending TCP handles the situation when its retransmission timer 
expires. 

16.5.6.1 First Timeout (Event 7) 

A retransmission occurs at time 62.486 (packet 2157) for sequence number 1773801 
(highlighted in Figure 16-15). Immediately prior to this, there is no evidence of 
duplicate ACKs or SACKs. 
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C sack-to-free-12.td - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony Tools Help 

No. Time Info 


2150 60.162016 6666 > 1059 LACKJ Seq=l Ack=1769601 Wln=231616 Len=0 TSV-147590805 TSER=17152189 


2151 

2152 

60.164994 

60.200646 

1059 

6666 

> 6666 
> 1059 

[ACK] 

[ACK] 

seq=1785001 Ack=l win=5808 Len=1400 TSV=17152952 
5eq=l Ack=1771001 win=233016 Len=0 TSV=147590809 

TSER=147590805[Packet 
TSER=17152296 

size 

limited 

2153 

60.203388 

1059 

> 6666 

[ACK] 

Seq=1786401 Ack=l Win=5808 Len=1400 TSV=17152990 

TSER=147590809 



2154 

60.325522 

6666 

> 1059 

[ACK] 

Seq=l Ack=1772401 Win=233016 Len=0 TSV=147590822 

TSER=17152298 



2155 

2156 

60.906208 

60.910008 

6666 

1059 

> 1059 

> 6666 

[ACK] 

[ACK] 

Seq=l Ack=1773801 win=233016 Len=0 TSV=147590880 
Seq=1787801 Ack=l win=5808 Len=1400 TSV=17153696 

TSER=17152339 
TSER=147590880[Packet 

size 

limited 

2157 

62.485688 

[TCP 

Retransmission] 1059 > 6666 [ack] seq=1773801 Ack=l win=5808 i 

.en=1400 T5V=17155274 TSER=147590880| 

2158 

62.756638 

6666 

> 1059 

[ACK] 

Seq=l Ack=1775201 Win=233016 Len=0 TSV=147591065 

TSER=17152514 



2159 

62.885469 

6666 

> 1059 

[ACK] 

Seq=l Ack=1776601 Win=233016 Len=0 TSV=147591078 

TSER=17152550 



2160 

62.889648 

1059 

> 6666 

[ACK] 

Seq=1789201 Ack=l win=5808 Len=1400 TSV=17155676 

TSER=147591078[Packet 

size 

limited 

2161 

62.890796 

1059 

> 6666 

[ACK] 

Seq=1790601 Ack=l win=5808 Len=1400 TSV=17155676 

TSER=147591078[Packet 

size 

limited ' 









> 

|si Frame 2157: 1474 bytes on 

wi re 

^11792 bits), 128 bytes captured (1024 bits) 




Is Ethernet ll, 

Src: 

00:00:el:08:8c:eb (00:00:el:08:8c:eb), Dst: 00:10:67:00:8c:d6 

(00:10:67:00:8c:d6) 



Is PPP-over-Ethernet 

Session 






Is Point-to-Point Protocol 

Is internet Protocol, 

src: 6- 

S.203. 

72.138 (63.203.72.138), Dst: 169.229.62.97 (169.229.62.97) 



Is Transmission 

control protocol. 

src Port: 1059 (1059), Dst port: 6666 (6666), seq 

: 1773801, Ack: 1, Len 

: 1400 


Source port: 1059 (1059) 

Destination port: 6666 (6666) 

[Stream index: 0] 

Sequence number: 1773801 (relative sequence number) 

[Next sequence number: 1775201 (relative sequence number)] 
Acknowledgement number: 1 (relative ack number) 

Header length: 32 bytes 
S Flags: 0x10 (ACK) 

Window size: 5808 (scaled) 

IS Checksum: 0x717b [unchecked, not all data available] 

IS options: (12 bytes) 

IS [SEQ/ACK analysis] 

IS [Timestamps] 

[Packet size limited during capture: 3XTA truncated] 


Figure 16-15 The sender experiences its first timeout when RTO = 1.57s. In this case, the sender declares the 
timeout to be spurious and undoes the modifications it made to its congestion control state. 


In Figure 16-15, at time 62.486, about 1.58s have elapsed since the last ACK was 
received, but according to Figure 16-8, the estimated RTT at this point is only about 
800ms. Thus, we may conclude this retransmission to be the result of a retransmis¬ 
sion timer expiration. This places TCP into the Loss state, which ordinarily causes 
a drastic reduction of cwnd and effecfively resfarfs fhe TCP in slow sfarf. Here, 
TCP sefs cwnd = 1 and ssthresh = 5, placing TCP in slow sfarf, as expecfed. The 
fimeouf also forces any stored SACK informafion to be discarded. However, fhe 
receiver confinues to send SACK informafion, so fhe sender can sfill make use of 
new SACK informafion if receives. 


Note 

TCP is supposed to “forget” its knowledge about received SACK information when 
experiencing a timeout because of the possibility that a receiver may renege on 
SACK information it provided earlier. This is suggested by [RFC2018] because 
of the (obscure) possibility that a receiver may wish to adjust its buffering so 
as to delete out-of-order data it has accumulated. Although not common, such 
behavior is permitted. When a receiver reneges, it is required to include the most 
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recently received data blocks in the first SACK block of ACKs generated, even If 
It is discarded. Except for this block, additional blocks must cease to report data 
no longer being held at the receiver. 


Most interestingly here, this congestion action is undone. As discussed earlier, 
the Eifel Response Algorithm can be invoked when TCP believes a retransmission 
timeout to be erroneous. In this case, it is declared erroneous because of evidence 
in the timestamp. The ACK received at time 62.757 for sequence number 1775201 
(packef 2158) carries a TSOPT wifh TSV of 17152514. However, fhe refransmission 
has fhe TSV of 17155274. Because fhe TSER field in fhe ACK covering fhe refrans- 
miffed segmenf is earlier fhan fhe refransmission, fhe hole fhe refransmission was 
affempfing fo fill was nof really a hole af all. Insfead, fhe expirafion of fhe refrans¬ 
mission fimer musf have been erroneous. 

By declaring fhe refransmission fimer expirafion fo be erroneous and invok¬ 
ing an Eifel-like response algorifhm, TCP resfores cwnd and ssthresh fo fheir for¬ 
mer values of 10 and immediafely shifts fo a normal operafing sfafe. This acfivafes 
fhe congesfion avoidance algorifhm, and TCP confinues wifhouf much fuss. 

16.5.6.2 Fast Retransmit (Event 8) 

The arrival of a duplicafe ACK for sequence number 1789201 carrying SACK block 
[1792001,1793401] af fime 67.510 (packef 2179) places TCP info fhe Disorder sfafe 
once again. The largesf sequence number senf so far when fhis sfafe is enfered 
is 1806000. Addifional arriving SACKs frigger enfry info fhe Recovery sfafe and 
sending of anofher fasf refransmif af fime 67.550 for sequence number 1789201 
(packef 2182). This reduces ssthresh fo 5 and cwnd begins shrinking unfil if also 
reaches 5. Recovery is complefe wifh fhe arrival of an ACK af fime 67.916 confain- 
ing sequence number 1806001 (packef 2197). 

16.5.6.3 CWR Again (Event 9) 

There is anofher local congesfion evenf af fime 77.121 when cwnd = 18. This sefs 
ssthresh = 9 and places TCP info fhe CWR sfafe once again. However, fhe reducfion 
of cwnd in fhe CWR sfafe fhis fime is inferrupfed early by a fimeouf, when cwnd 
has been reduced by only 1, fo 8. 

16.5.6.4 Second Timeout (Event 10) 

Anofher refransmission fimeouf friggers a refransmission af fime 78.515 for 
sequence number 2175601 (nof picfured). This sefs cwnd = 1; ssthresh is sfill 9 and 
fhe refransmiffed segmenf carries fhe TSOPT TSV value of 17171306. As wifh 
fimeouf evenf 7, fhis congesfion acfion is also undone, by fhe arrival of fhe ACK 
af fime 80.093 for sequence number 2179801 (packef 2641) confaining fhe TSOPT 
TSER value of 17169948. When fhis happens, the flight size esfimafe is 2,184,001 + 
1400 - 2,179,801 = 5600 byfes (four packefs). If cwnd were immediafely restored 
fo ifs pre-fimeouf condifion (8), fhis would allow four packefs fo be immediafely 
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injected into the network. Doing so is considered undesirable because it may lead 
to increased changes of dropped packets because of burstiness. 

To prevenf fhis bursfy behavior, fhis Linux TCP implemenfafion has a conges¬ 
tion window moderation procedure, which limifs fhe maximum number of packefs 
generafed in response fo a single ACK fo maxburst, wifh a value of 3 packefs in 
fhis example. In fhis case, cwnd is fherefore sef fo {flight size + maxburst) = 4 + 3 = 
7. This regulafion is relafed fo fhe paramefer of fhe same name proposed for TCP 
and evaluafed using fhe NS-2 nefwork simulator. This simulafor has been used 
exfensively in fhe explorafion and developmenf of new TCP algorifhms [NS2]. 

16.5.6.5 Timeout and Final Recovery (Event 11) 

At fime 88.929 a refransmission fimer has expired and a refransmission for 
sequence number 2185401 occurs, as depicfed in Figure 16-16. 


B0B 


Q? sack-to-free-12.td - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony Tools Help 

ei iW W i( t( B 0 X 3 ! a ^ 7 ^ ^ H □ il ^ 

'lo, Time Info 

2653 81.149563 1059 > 6666 [ACK] Seq=2196601 Ack=l Win=5808 Len=1400 TSV=17173938 TSER=147592904[Packet size limited 

2654 81.149985 1059 > 6666 [ACK] Seq=2198001 Ack=l win=5808 Len=1400 TSV=17173939 TSER=147592904 [Packet size limited 


1059 > 6666 [ACK] Seq=2185401 Ack=l Win=5808 Len=1400 TSV=17181721 TSER=147592904 


2656 89.075960 6666 > 1059 [ACK] Seq=l ACk=2186801 win=233016 Len=0 TSV=147593697 TSER=171S1721 


2657 89.078853 [TCP Retransmission] 1059 > 6666 [ACK] Seq=2186801 Ack=l win=5808 Len=1400 TSV=17181870 TSER=147593697 


2659 89.163666 6666 > 1059 [ACK] Seq=l Ack=2189601 Win=231616 Len=0 T5V=147593705 TSER=17181870 


2660 89.16/'320 [TCP Retransmission] 1050 > 6666 [ACK] Seq=218'j601 Ack=l Win=5808 Len=140Q T5V=l/181‘j58 TSER=147503705 


2662 89.168809 [TCP Retransmission] 1059 > 6666 [ACK] Seq=2192401 Ack=l Win=5308 Len=1400 TSV=171S1959 TSER=147593705 


2663 89.215408 6666 > 1059 [ACK] Seq=l Ack=2191001 Win=233016 Len=0 TSV=147593711 TSER=17181958 


2664 89.219614 [TCP Retransmission] 1059 > 6666 [ACK] Seq=2193801 Ack=l Win*5a08 Len*1400 TSV=17182010 TSER=147593711 


E Frame 2655: 1474 bytes on wire (11792 bits), 128 bytes captured (1024 bits) 

E Ethernet II, Src: 00:00:el:08:8c:eb (00:00:el:08:8c:eb), Dst: 00:10:67:00:8c:d6 (00:10:67:00:8c:d6) 

E PPP-over-Ethernet Session 
E Point-to-Point protocol 

E internet protocol, src: 63.203.72.138 (63.203.72.138), Dst: 169.229.62.97 (169.229.62.97) 

B Transmission Control Protocol, src Port: 1059 (1059), Dst Port: 6666 (6666), Seq: 2185401, Ack: 1, Len: 1400 
Source port: 1059 (1059) 

Destination port: 6666 (6666) 

[Stream index: 0] 

Sequence number: 2185401 (relative sequence number) 

[Next sequence number: 2186801 (relative sequence number)] 

Acknowledgement number: 1 (relative ack number) 

Header length: 32 bytes 
S Flags: 0x10 (ACK) 

window size: 5808 (scaled) 

E Checksum: 0xf203 [unchecked, not all data available] 

E Options: (12 bytes) 

E [SEQ/ACK analysis] 

B [Timestamps] 

[Time since first frame in this tcp stream: 88.928759000 seconds] 

[Time since previous frame in this TCP stream: 7.778774000 seconds] 


Figure 16-16 A retransmission timer expires, initiating a timeout-based retransmission that cannot be undone. 
TCP continues in slow start. 


The expiring fimer places fhe sender info slow sfarf wifh ssthresh = 5. This 
fime, TCP is nof able fo undo fhe fimeouf, so cwnd is sef fo 1 and slow sfarf pro¬ 
gresses. This can be seen more clearly from fhe flow frace (see Figure 16-17). 
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Figure 16-17 In Wireshark, the slow start behavior is apparent after a retransmission timeout. Each 
arriving ACK liberates two or three packets. 

The retransmission for sequence number 2185401 is highlighted. Following 
the retransmission, we see the typical slow start behavior we saw during the 
beginning of the connection, when each arriving ACK liberates two or three pack¬ 
ets, depending on how many packets were covered by the ACK. By time 89.434, 
when cwnd has reached ssthresh at 5, TCP continues in congestion avoidance. 

16.5.7 Connection Completion 

The final exchange of packets commences with the sender's transmission of a FIN 
at time 99.757. Following this transmission, 13 ACKs arrive followed by the receiv¬ 
er's FIN. The very last packet (a final ACK) is sent at time 100.476. This exchange 
is depicted in Figure 16-18. 

The largest sequence number sent is 2620801 + 640 - 1 = 2621440, equiva¬ 
lent to the size of the overall transfer, 2.5MB. At time 99.757, (2,619,401 + 1400 - 
2,594,201)71400 + 1 = 20 packets are outstanding. The arrival of 13 ACKs (7 of 
which ACK two packets each) covers the whole window of (2’^7) + (13 - 7) = 20 
packets. Note that the ACK arriving at time 100.474 ACKs the final two packets of 
sizes 1400 and 640 bytes, respectively: 2,621,442 - 2,619,401 = 1400 + 640. 
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Figure 16-18 During the connection closing procedure, the receiver produces 13 pure ACKs to indi¬ 
cate that it has received all of the data the sender has produced. The final FIN-ACK 
exchange completes closure of the other half of the connection. Note that the FIN seg¬ 
ments contain valid ACK numbers. 


This extended example illustrates most of the algorithms described so far and 
includes aspecfs of fhe basic TCP algorifhms (slow sfarf, congesfion avoidance), 
selecfive acknowledgment rafe halving, as well as some newer procedures such 
as spurious RTO defecfion. We now discuss some modificafions and capabilifies 
fhaf are less widespread, more speculafive, or more recenf. The Linux TCP slack 
implemenfs many of fhese procedures, buf nof all of fhem are enabled by defaulf. 
Frequenfly, a small change using fhe sysctl program is sufficienf fo experimenf 
wifh fhem. More recenf versions of fhe Windows slack (i.e., Windows Visfa and 
lafer) also implemenf improvemenfs beyond fhe feafures discussed so far. 


16.6 Sharing Congestion State 

The discussion so far and fhe example we have jusf seen have focused on how a 
single TCP connecfion adapfs fo congesfion along fhe pafh. If ofher connecfions 
befween fhe same hosfs are made lafer, fhese subsequenf connecfions fypically 
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have to establish their own values for ssthresh and cwnd over time as described pre¬ 
viously. In many cases, subsequent connections could possibly learn of fhese values 
from earlier connecfions fo fhe same hosfs or from ofher currenfly acfive connec- 
fions fo fhe same hosfs. This idea involves sharing fhe congesfion sfafe across mul- 
fiple connecfions in fhe same machine. An early descripfion in [RFC2140], enfifled 
"TCP Confrol Block Inferdependence," describes how fhis mighf be accomplished. 
This work nofes fhe difference befween temporal sharing (new connecfions share 
informafion wifh ofhers fhaf are now CLOSED) and ensemble sharing (new connec¬ 
fions share sfafe wifh ofher acfive connecfions). 

In an efforf fo generalize fhis idea and exfend if fo protocols and applica- 
fions ofher fhan TCP, [RPC3124] describes fhe Congestion Manager, which provides 
a local operafing system service available fo profocol implemenfafions fo learn 
informafion such as pafh loss rate, esfimafed congesfion, RTT, and so forfh fo des- 
finafion hosfs. 

In Linux, fhis idea is made available in fhe same subsysfem fhaf confains rouf- 
ing informafion and is known as desfinafion mefrics, which we saw in Chapfer 
15. These mefrics are enabled (buf fhey were disabled for fhe extended example 
by setting fhe sysctl variable net.ipv4.tcp_no_inetrics_save fo 1). When 
a TCP connecfion goes fo fhe CLOSED sfafe, fhe following informafion is saved: 
RTT measuremenfs {srtt and rttvar), an esfimafe of reordering, and fhe congesfion 
confrol variables cwnd and ssthresh. These are used when new connecfions fo fhe 
same desfinafion sfarf fo help inifialize fhe corresponding measuremenfs. 


16.7 TCP Friendliness 

TCP being fhe dominanf fransporf profocol on fhe Infernef, if is commonplace for 
several TCP connecfions fo be sharing one or more routers along fheir delivery 
pafhs. While fhey do nof always share bandwidfh equally in such circumsfances, 
fhey do af leasf reacf fo fhe dynamics of ofher TCP connecfions as fhey come and 
go over fime. This is nof guaranfeed fo be fhe case, however, when TCP compefes 
for bandwidfh wifh ofher (non-TCP) protocols, or when if compefes wifh a TCP 
using some alfernafive sef of confrols on ifs congesfion window. 

To provide a guideline for profocol designers fo avoid unfairly compeling wifh 
TCP flows when operafing cooperafively on fhe Infernef, researchers have devel¬ 
oped an equation-based rate control limif fhaf provides a bound of fhe bandwidfh 
used by a convenfional TCP connecfion operafing in a parficular environmenf. 
This mefhod is called TCP Friendly Rate Control (TFRC) [RFC5348][FHPW00]. If 
is designed fo provide a sending rate limif based on a combinafion of connecfion 
paramefers and wifh environmenfal factors such as RTT and packef drop rafe. If 
also gives a more sfable bandwidfh ufilizafion profile fhan convenfional TCP, so if 
is expected fo be appropriafe for sfreaming applicafions fhaf use moderafely large 
packefs (e.g., video fransfer). TFRC uses fhe following equafion fo defermine a 
sending rafe: 
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X = s/(Rj2b^)+3pt^^o{'^ + 32p^)^3b^ [2] 

Here, X is the throughput rate limit (bytes/second), s is the packet size (bytes, 
excluding headers), R is the RTT (seconds), p is the number of loss events as a frac¬ 
tion of packets sent [0,1.0], is the retransmission timeout (seconds), and b is the 
maximum number of packets acknowledged by a single ACK. The value of 
recommended to be 4R, and the recommended value of b is 1. 

The TCP sending rate can be expressed another way, based on how it adjusts 
its window in response to receiving a good ACK during congestion avoidance. 
Recall from the earlier discussion that standard TCP, when using the congestion 
avoidance algorithm, increases cwnd by an additive amount of 1/cwnd for each 
arriving good ACK and decreases it by a multiplicative factor of one-half on a loss 
event. This is called additive increase/multiplicative decrease (AIMD) congestion con¬ 
trol, and we can produce a generalized AIMD congestion avoidance equation by 
replacing the values of 1/cwnd and Vi with variables a and b as follows: 

cwnd^^^ = cwnd ^ + a/cwnd^ 
cwnd, , = cwnd, - b* cwnd, 

t+1 t t 

Based on results from [FHPWOO], this equation gives TCP the following sending 
rate, in packets per RTT: 


T = 


a{2-b) 


2b 


[3] 


For regular TCP, where a = l and b = 0.5, this simplifies to T = 1.2/y^, known 
as the simplified standard TCP response function. It relates the speed of TCP (regula¬ 
tion of cwnd) to the packet drop rate the TCP experiences, without accounting for 
retransmission timeouts. When TCP is not limited by other factors (sender's or 
receiver's buffers, window scaling, etc.), this relationship governs TCP's perfor¬ 
mance in benign operating environments. 

Any alteration to TCP's response function obviously affects the way it (or 
another protocol implementing a similar congestion control scheme) competes 
with standard TCP. Therefore, new proposed congestion control schemes are typi¬ 
cally analyzed using a measure of relative fairness. Relative fairness gives the ratio 
of the speed of the protocol using a modified congestion control scheme relative 
to standard TCP, as a function of the packet drop rate. This is a strong indicator of 
how fair any such modified schemes are with respect to sharing bandwidth across 
a common Internet path. 
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Note that understanding these equations is only the first step in creating a 
speed regulation regime that competes fairly wifh sfandard TCP. The defails of 
implemenfing TFRC for any parficular protocol can be subfle and include how to 
correcfly measure fhe RTT, loss even! rafe, and packef size. These issues are dis¬ 
cussed in some defail in [RFC5348]. 


16.8 TCP in High-Speed Environments 

In high-speed nefworks wifh large BDPs (e.g., WANs of IGb/s or more), conven- 
fional TCP may nof perform well because ifs window increase algorifhm (fhe con- 
gesfion avoidance algorifhm, in parficular) fakes a long fime to grow fhe window 
large enough to safurafe fhe nefwork pafh. Said anofher way, TCP can fail to fake 
advanfage of fasf nefworks even when no congesfion is presenf. This issue arises 
primarily from fhe fixed addifive increase behavior of congesfion avoidance. If 
we consider a TCP using 1500-byfe packefs operafing over a lOGb/s long-disfance 
link, some 83,000 segmenfs are required to be oufsfanding in order fo fully ufi- 
lize fhe available bandwidfh, assuming no packef drops or errors in five billion 
packefs. For an RTT of 100ms, fhis fakes abouf 1.5 hours fo achieve. In order fo 
address fhis deficiency, a number of researchers and developers have explored 
ways fo alfer TGP in order for if fo perform beffer in such nefworks, while refain- 
ing a degree of fairness fo sfandard TGP, especially for more common lower-speed 
environmenfs. 

16.8.1 Highspeed TCP (HSTCP) and Limited Siow Start 

The experimental Highspeed TGP (HSTGP) specifications [RFG3649][RFG3742] 
propose to alter the standard TGP behavior when the congestion window is larger 
than a base value Low_Window, suggested to be 38 MSS-size segments. This value 
corresponds to a packet drop rate of 10'® based on the simplified TGP response 
function given previously. This function is linear on a log-log plot of sending rate 
versus packet loss rate, so it is really a power law function. 


Note 

Functions that form a line on a iog-iog piot are calied power law functions. They 
have equations of the form y = ax*, meaning iog y = iog a + /c log x (a and k are 
constants). This equation forms a line with siope /con a Iog-iog plot. 


To construct the type of power law function required, we select two points 
and create the equation that describes the line passing between them. Gonsider 
two such points as (pj, Wj) and {Pg, W^) where w^>Wg>0 and 0 < p^ < Pg. On a lin¬ 
ear plot, this would form a line with slope {w^ - Wg)/( p^ - Pg), but on a log-log plot 
it forms a line with slope S = (log w^ - log Wg)/(log Pj - log Pg). Then, based on the 
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equation in the Note, we have w = Cp^, and we require some point, say {Pg, Wg), to 
determine C. After some algebra, we find that C = Pg^ Wg, meaning w = p^ Pg^ Wg. 

In Figure 16-19, we see a plot of both the conventional TCP response func¬ 
tion and a proposed response function for HSTCP based on the point {Pg, = 
(.0015, 31) and S = -0.82. Note that for larger packet drop rates (over about .001) 
the response functions are the same, so these equations apply only for a certain 
maximum value of p. Comparing the two lines, when the packet drop rate is small 
enough, HSTCP is allowed to send more aggressively. 



Figure 16-19 With Highspeed TCP, the TCP response function is altered to be more aggressive 
for low packet drop rates and large windows, leading to higher throughputs for 
high bandwidth-delay-product networks. Image from presentation by Sally Floyd to IETF 
TWVWG, Mar. 2003. 


To have TCP achieve this response function, the congestion avoidance proce¬ 
dure is modified to take into account the current size of the window when making 
changes. This takes place, as with conventional TCP, upon the arrival of a good 
ACK. The response for a good arriving ACK is generalized as follows: 

= cwnd^ + a(cwnd^/cwnd^ 

When responding to a congestion event (e.g., packet loss, ECN indication), it 
responds as follows: 


cwnd^^^ = cwnd^ - h{cwnd)* ciund^ 




Ill 


TCP Congestion Control 


Here, a{) is the additive increase function and b() is the multiplicative decrease 
function. In this generalization of sfandard TCP, fhey are funcfions of fhe currenf 
window size. To achieve fhe desired response funcfion, we sfarf by generalizing 
from equafion [3]: 


Wo = 


a(iy)(2 - b(iy)) 
\ 2h(w) 

Wo 


This gives: 


afif) = 2PgWg^h(w)/{2 - bfif)) 

This relafionship does nof have a unique solufion—fhaf is, fhere are many 
combinafions of a() and bO fhaf safisfy fhe relafionship, even fhough some of 
fhem may nof be pracfical or desirable for deploymenf. 

Addifional defails of fhe changes proposed fo fhe congesfion avoidance pro¬ 
cedure for TCP suggesfed by HSTCP are available in [RFC3649]. A companion 
documenf [RFC3742] describes how slow sfarf can be modified fo help TCP obfain 
a working congesfion window in such environmenfs. This is called limited slow 
start and is designed fo slow down slow sfarf, so fhaf a TCP operafing wifh large 
windows (fhousands or fens of fhousands of packefs) does nof double ifs window 
in one RTT. 

Wifh limifed slow sfarf, a new paramefer called max_ssthresh is infroduced. 
This value is nof fhe maximum value of ssthresh buf insfead a fhreshold for cwnd 
fhaf works as follows: If cwnd <= max_ssthresh, slow sfarf proceeds as normal. If 
max_ssthresh < cwnd <= ssthresh, fhen cwnd is increased by af mosf {max_ssthresh/2) 
SMSS per RTT. This is accomplished by modifying fhe managemenf of cwnd dur¬ 
ing slow sfarf as follows: 


if (cwnd <= max_ssthresh ) { 

cwnd = cwnd + SMSS (regular slow start) 

} else { 

K = intlcvmd / (0.5 * max_ssthresh)) 

cwnd = cwnd + int ((1/K) *SMSS) (limited slow start) 

} 


A suggesfed possible inifial value for max_ssthresh is 100 packefs, or lOO^SMSS 
in byfes. 

16.8.2 Binary Increase Congestion Control (BIC and CUBIC) 

HSTCP is one of several proposals for modifying TCP fo provide higher fhrough- 
puf for large BDP nefworks. While if considers fhroughpuf and fairness wifh 
respecf fo convenfional TCPs in similar circumsfances, and elecfs fo be more 
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aggressive than standard TCP under certain circumstances, it does not attempt 
to directly control what happens when HSTCP connections with differing RTTs 
compefe wifh each ofher (called "RTT fairness"). This was sfudied for sfandard 
TCP some years back, revealing fhaf TCPs wifh shorfer RTTs obfain a larger share 
of fhe bandwidfh on shared links as compared fo fhose having larger RTTs, when 
using fhe same packef size and ACK sfrafegy [F91]. For TCPs fhaf increase cwnd 
as a funcfion of ifs size (so-called bandwidth-scalable TCPs), fhis unfairness can be 
even more severe. Whefher RTT fairness should be considered desirable is sub- 
jecf fo debafe. Alfhough RTT fairness would seem affracfive from firsf principles, 
connecfions wifh larger RTTs are likely fo be using more nefwork resources (e.g., 
passing fhrough more roufers), so if may be reasonable for fhem fo receive some- 
whaf less fhroughpuf. In any case, knowing jusf how RTT (un)fairness behaves is 
a driving factor behind fhe popular TCP varianfs we explore nexf. 

16.8.2.1 BIC-TCP 

In an efforf fo create a scalable TCP and deal wifh fhe issue of RTT fairness, BIC- 
TCP (formerly called BI-TCP) [XHR04] was developed and deployed in Linux ker¬ 
nels sfarfing wifh version 2.6.8. The main goal of BIC TCP is fo provide linear RTT 
fairness even fhough congesfion windows may be quife large (which is required 
fo use high-bandwidfh links). Linear RTT fairness means fhaf connecfions receive 
a bandwidfh share inversely proporfional fo fheir RTTs, rafher fhan some more 
complicafed or unknown funcfion. 

The approach modifies a sfandard TCP sender wifh fwo algorifhms: binary 
search increase and additive increase. These algorifhms are invoked affer a conges¬ 
fion indicafion (e.g., packef loss), buf only one of fhe algorifhms is in operafion af 
any given poinf in fime. The binary search increase algorifhm operates as follows: 
The current minimum window is fhe lasf poinf af which fhe connecfion experienced 
no packef loss during an enfire RTT. The maximum window is fhe window size af 
which fhe connecfion lasf experienced loss, if known. The desired window lies 
somewhere befween fhe fwo. Using a binary search technique, BIC-TCP selecfs a 
trial window in fhe midpoinf of fhese fwo values and fries again recursively. If fhis 
poinf shows confinued packef loss, if becomes fhe new maximum and fhe process 
repeafs. If nof, if becomes fhe new minimum and fhe process repeafs. The process 
ferminafes when fhe difference befween fhe minimum and maximum windows is 
less fhan a predefined fhreshold called fhe minimum increment, or S . 

The algorifhm fends fo find fhe desirable window, also called fhe saturation 
point, in a logarifhmic number of frials, whereas a sfandard TCP would require 
a linear number (half of fhe difference in window sizes, on average). Thus, fhis 
approach makes BIC-TCP more aggressive fhan sfandard TCP during cerfain 
periods of operafion, buf fhis is desired in order fo fake advanfage of high-speed 
environmenfs wifhouf unwanfed delay. The profocol is unusual, relafive fo ofher 
proposals, because ifs increase funcfion is concave af some poinfs—fhaf is, ifs 
increase gefs smaller as if gefs closer fo fhe safurafion poinf. Mosf ofher algorifhms 
use large change incremenfs nearesf fhe safurafion poinf. 
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The additive increase algorithm works as follows: When using binary search 
increase, the situation can arise where the distance from fhe currenf window size 
fo fhe midpoinf (in fhe sense of fhe binary search described previously) is large. 
Increasing fhe window fo fhe midpoinf in one RTT may be ill advised because of 
fhe pofenfial for injecfing large packef bursfs info fhe nefwork. This is prevenfed 
by fhe addifive increase algorifhm, which is invoked when fhe disfance fo fhe 
midpoinf from fhe currenf window is more fhan some amounf S . When fhis 
happens, fhe incremenf is limifed fo per RTT, called window clamping. Once fhe 
midpoinf is closer fhan S fo fhe frial window, binary search increase fakes over. 
Overall, upon defecfion of a loss, fhe window is reduced by a mulfiplicafive facfor 
P, and ifs growfh sfarfs again wifh addifive increase and swifches fo binary search 
once fhe desired increase amounf is less fhan S . The aufhors call fhe combined 

max 

algorifhms binary increase, or BI. 

When fhe window grows beyond fhe currenf maximum, or no maximum 
is yef known because no loss even! has occurred, if musf be esfablished. This is 
accomplished by a procedure known as max probing. The purpose of max probing 
is fo use bandwidfh when if becomes available. If proceeds in a way symmefric fo 
fhe addifive increase and binary increase algorifhms. If sfarfs in small inifial incre- 
menfs, followed by larger incremenfs if no congesfion is indicafed. The approach 
shows good sfabilify because small changes are made near fhe safurafion poinf, 
where fhe nefwork is believed fo be operafing near ifs greafesf capacify. 

Linux (kernels 2.6.8 fhrough 2.6.17) includes an implemenfafion of BIC- 
TCP fhaf is enabled by defaulf. Four sysctl paramefers confrol ifs operafion: 
net.ipv4.tcp_bic, net.ipv4.tcp_bic_beta, net.ipv4.tcp_bic_low_ 
window, and net.ipv4.tcp_bic_fast_convergence. The firsf Boolean vari¬ 
able confrols whefher BIC is used (as opposed fo fhe convenfional fasf refransmif/ 
recovery procedures). The nexf confains a scaling facfor for cwnd fo defermine 
(defaulf 819). The nexf paramefer confrols fhe minimum size of fhe conges¬ 
fion window before fhe BIC-TCP confrol algorifhms fake over. Ifs defaulf value 
is 14, meaning fhaf for small window values sfandard TCP congesfion confrol is 
used. The lasf paramefer is a flag, enabled by defaulf. When sef, if affecfs fhe way 
fhe new maximum and fargef windows are selecfed when fhe binary increase 
algorifhm is in a downward frend. During a window reducfion, fhe new maxi¬ 
mum and minimum windows are sef fo fhe currenf and scaled (down by a facfor 
of befa) values of cwnd, respecfively. If fasf convergence is enabled and fhe value 
of fhe new maximum is less fhan ifs previous value before if was sef fo cwnd, fhe 
value of fhe maximum window is furfher reduced befween fhe average of if and 
fhe minimum window. Affer fhis, whefher or nof fasf convergence is enabled, fhe 
fargef window is fhe average of fhe maximum and minimum values. This helps fo 
achieve even bandwidfh sharing more quickly when mulfiple BIC-TCP flows are 
sharing fhe same roufer. 
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16.8.2.2 CUBIC 

The authors of BIC-TCP revised their basic algorithms to form a new control algo¬ 
rithm called CUBIC [HRX08]. It has been the default congestion control algorithm 
used in Linux TCP since kernel version 2.6.18. It addresses concerns raised that 
BIC-TCP may be too aggressive under some circumstances. It also simplifies the 
window growth procedures. Instead of using a threshold (S^^J to decide when to 
invoke the binary search increase versus additive increase, an odd-degree polyno¬ 
mial function, in particular a cubic function, is used instead to control the window 
increase function. Cubic functions can have both convex and concave portions, 
meaning that they can grow more slowly in some portions (concave) and more 
quickly in others (convex). Until BIC and CUBIC, virtually all of the TCP literature 
advocated convex window growth functions. The specific window growth func¬ 
tion, used by CUBIC to set cwnd, is as follows: 

W(t) = C(t - K)3 -I- 

In this equation, W(t) is the window at time t. C is a constant parameter (default 
0.4), t is the elapsed time in seconds since the last window reduction, and K is the 
time period the function takes to increase W to W when there is no further loss 
event. W is the last window size prior to the last window adjustment. K can be 

max ± J 

calculated as follows: 


where [3 is the multiplicative decrease constant (default 0.2). An illustration of the 
CUBIC window growth function for K = 2.71, = 10, and C = 0.4 on the interval 

t = [0,5] is shown in Figure 16-20. 

This figure illustrates how the CUBIC window growth function contains both 
a concave portion and convex portion. When a fast retransmit occurs, W is set to 
cwnd, and new values of cwnd and ssthresh are set to ^*cwnd. CUBIC uses a default 
value of 0.8 for [3. The value W(t -i- RTT) gives the next target congestion win¬ 
dow value. When an additional ACK arrives during congestion avoidance, cwnd is 
increased by (W(t -i- RTT) - cwnd)/cwnd. 

It is worth noting that having t be the amount of elapsed time since the last 
window reduction event helps to ensure RTT fairness. Instead of changing the 
window by some fixed amount when ACKs arrive, the window change amount is 
a function of the elapsed time since the last window change. This decouples the 
window change operations from the particular pattern of ACK arrivals. 

In addition to the cubic operating region, CUBIC also has a "TCP-friendly" 
region that operates when the window is small to ensure that CUBIC is not 
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Figure 16-20 The CUBIC window growth function is a cubic function of t. It has a concave por¬ 
tion in the area where lV(f) < In this region, CUBIC searches for the saturation 
point by growing civnd with decreasing aggressiveness. After is reached, the 
growth function becomes convex, where it searches by growing cwnd with increasing 
aggressiveness. 
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penalized relative to regular TCP. More specifically, the window size of sfandard 
TCP in forms of fhe elapsed fime t, is given by 


^^-(0 = 3 


f?7T(l + p) 


P 


So if cwnd is less fhan when an ACK arrives during congesfion avoidance, 

CUBIC sefs cwnd = This ensures TCP friendliness in common low- fo mod- 

erafe-speed nefworks, where CUBIC would ofherwise be disadvanfaged. 

As menfioned earlier, CUBIC has been fhe defaulf congesfion confrol algorifhm 
for Linux kernels since 2.6.18. Since kernel version 2.6.13, however, Linux supporfs 
pluggable congestion avoidance modules [P07], allowing fhe user fo pick which algo¬ 
rifhm fo use. The variable net.ipv4.tcp_congestion_control confains fhe 
currenf defaulf congesfion confrol algorifhm (defaulf: cubic). The variable net. 
ipv4.tcp_available_congestion_control confains fhe congesfion confrol 
algorifhms loaded on fhe sysfem (in general, addifional ones can be loaded as 
kernel modules). The variable net.ipv4.tcp_allowed_congestion_con- 
trol confains fhose algorifhms permiffed for use by applicafions (eifher selecfed 
specifically or by defaulf). The defaulf supporfs CUBIC and Reno. 
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16.9 Delay-Based Congestion Control 

The approaches to congestion control we have seen so far are usually triggered by 
packet loss, detected using some combination of ACKs or SACKs, ECN (if avail¬ 
able), and expiration of a retransmission timer. ECN (see Section 16.11) allows a 
sending TCP to be informed about congestion prior to the need for the network 
to drop packets, but this requires participation from routers within the network 
that may not be available. However, even without ECN it is still possible to try to 
determine from a host whether congestion is about to occur within the network. 
One clue that congestion may be forming is an increase in measured RTT as the 
sender injects more packets into the network. We saw this situation in Eigure 16-8, 
where additional packets were being queued rather than delivered, contributing 
to a higher measured RTT (until packets were ultimately discarded). Several con¬ 
gestion control techniques depend on this observation. They are called delay-based 
congestion control algorithms, as opposed to the loss-based congestion control 
algorithms we have seen so far. 

16.9.1 Vegas 

In 1994, TCP Vegas was introduced [BP95]. It was the first delay-based congestion 
control approach for TCP published and tested by the community of TCP devel¬ 
opers. Vegas operates by estimating the amount of data it expects to transfer in a 
certain amount of time and comparing this with the amount of data it is actually 
able to transfer. If the requisite amount of data is not transferred, it is likely to 
be held up in a router queue along the path. If this condition persists, the Vegas 
sender slows down. This is in contrast to the standard TCP approach, which forces 
a packet drop to occur in order to determine the point at which the network is 
congested. 

While in its congestion avoidance phase, during each RTT, Vegas measures 
the amount of data transferred and divides this number by the minimum delay 
observed across the connection. It maintains two thresholds, a and [3 (where a 
< P). When the difference in expected throughput (window size divided by the 
smallest RTT observed) versus achieved throughput is less than a, the conges¬ 
tion window is increased; when it is greater than P, the congestion window is 
decreased. Otherwise, it is left as is. All changes to the congestion window are 
linear, meaning the scheme is an additive increase/additive decrease (AIAD) conges¬ 
tion control scheme. 

The authors describe a and P in terms of buffer utilization at a bottleneck link. 
The smallest values of interest are 1 for a and 3 for p. The reasoning behind these 
values is as follows: At least one packet buffer should be occupied in the network 
path (i.e., at the queue in the router incident with the minimum-bandwidth link 
on the path) to keep the network busy. If extra bandwidth becomes available, occu¬ 
pying two additional buffers (up to 3, the value for a) obviates the need to wait 
an extra RTT in order to inject more, which would be required if Vegas tried to 
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maintain only one buffer full. Furfhermore, having fhe region (P-a) as fhe oper- 
afing range leaves some room for minor changes in fhroughpuf wifhouf causing 
an immediafe change in fhe window, a form of damping fhaf aims fo reduce rafe 
oscillafions. 

Wifh a slighf modificafion, fhis approach can also be applied fo fhe slow sfarf 
period. Here, increasing cwnd by 1 for each good ACK is allowed only every other 
RTT. For fhose RTTs when if is nof increased, a measuremenf is made fo ensure 
fhaf fhroughpuf is increasing. If nof, fhe sender swifches fo fhe Vegas congesfion 
avoidance scheme. 

Under cerfain circumsfances, Vegas can be "fooled" info believing fhaf fhe 
forward-direcfion delay is higher fhan if really is. This happens when fhere is 
significanf congesfion in fhe reverse direcfion (recall fhaf fhe pafhs in fhe fwo 
direcfions of a TCP connecfion may be differenf and have differenf sfafes of con¬ 
gesfion). In such cases, packefs (ACKs) refurning fo fhe sending TCP are delayed, 
even fhough fhe sender is nof really confribufing fo fhe (reverse-pafh) congesfion. 
This causes Vegas fo reduce fhe congesfion window even fhough such an adjusf- 
menf is nof really necessary. This is a pofenfial piffall for mosf fechniques based 
on measuring RTT as a basis for congesfion confrol decisions. Indeed, significanf 
fraffic in fhe reverse direcfion can cause fhe ACK clock (Figure 16-1) fo be signifi¬ 
canf ly perfurbed [M92]. 

Vegas is fair relafive fo ofher Vegas TCPs sharing fhe same pafh because each 
pushes fhe nefwork fo hold only a minimal amounf of dafa. However, Vegas and 
sfandard TCP flows do nof share pafhs equally. A sfandard TCP sender fends 
fo fill queues in fhe nefwork, whereas Vegas fends fo keep fhem nearly empfy. 
Consequenfly, as fhe sfandard sender injecfs more packefs, fhe Vegas sender sees 
increased delay and slows down. Ulfimafely, fhis leads fo an unfair bias in favor 
of fhe sfandard TCP. Vegas is supporfed by Linux buf nof enabled by defaulf. For 
kernels prior fo 2.6.13, fhe Boolean sysctl variable net.ipv4.tcp_vegas_ 
cong_avoid defermines whefher if is used (defaulf 0). The variables net. ipv4. 
tcp_vegas_alpha (defaulf 2) and net.ipv4.tcp_vegas_beta (defaulf 
6 ) correspond fo fhe alpha and befa described previously buf are expressed in 
half-packef unifs (i.e., 6 corresponds fo 3 packefs). The variable net.ipv4. 
tcp_vegas_gainina (defaulf 2) configures how many half-packefs Vegas should 
affempf fo keep oufsfanding during slow sfarf. For kernels affer 2.6.13, Vegas musf 
be loaded as a separafe kernel module and enabled by setting net.ipv4.tcp_ 
congestion_control fo vegas. 


16.9.2 FAST 

FAST TCP was developed wifh parficular affenfion fo operafions in high-speed 
environmenfs wifh large bandwidfh-delay producfs [WJLH06]. Similar fo Vegas 
in spirif, if adjusfs fhe window based on fhe difference befween an expecfed 
fhroughpuf rafe and an experienced rafe. If differs from Vegas by adjusfing fhe 
window based nof only on fhe window size, buf also on fhe difference befween 
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the current and expected performance. It updates the sending rate every other 
RTT using a rate-pacing technique. If the measured delay is significantly below 
a threshold, the window is updated aggressively followed by a period when the 
increase is less aggressive. When the delay increases, the reverse takes place. FAST 
differs from the other approaches we have discussed because it is the subject of 
several patents and is being commercialized independently. It has received some¬ 
what less scrutiny from the research community, but an independent evaluation 
[S09] has shown it to have good stability and fairness properties. 

16.9.3 TCP Westwood and Westwood+ 

TCP Westwood (TCPW) and TCP Westwood+ (TCPW-r) aim at handling large band- 
width-delay-product paths by modifying a conventional TCP NewReno sender. 
TCPW-i- is a correction to the original TCPW algorithm, so we will just refer to 
either as TCPW. In TCPW, the sender's eligible rate estimate (ERE) is an estimate of 
the bandwidth available on the connection. It is continuously computed in a fash¬ 
ion somewhat similar to Vegas (based upon the difference between an expected 
and an achieved rate), but with a variable measurement interval for the rates 
based on the dynamics of ACK arrivals. When congestion is low, the measurement 
interval is small, and vice versa. When a packet loss is detected, instead of reduc¬ 
ing cwnd by half, TCPW computes an estimated BDP (ERE times the minimum 
RTT observed) and uses this as the new value for ssthresh. Agile probing [WYSG05] 
adaptively and repeatedly sets ssthresh when a connection would otherwise oper¬ 
ate in slow start. This causes cwnd to grow exponentially in cases where ssthresh 
has been increased (by initiating slow start). Westwood can be enabled in Linux 
kernels after 2.6.13 by loading a TCPW module and setting net. ipv4. tcp_con- 
gestion_control to westwood. 

16.9.4 Compound TCP 

Starting with Windows Vista, it is possible to choose which congestion control 
procedure ("provider") TCP should use, in a way similar to Linux's pluggable 
congestion avoidance modules. One such option (but not the default, except for 
Windows Server 2008) is called Compound TCP (CTCP) [TSZS06]. CTCP makes 
window adjustments based upon packet loss, but also based on measured delays. 
In some sense it is a combination of standard TCP and Vegas, but with the seal- 
ability features of HSTCP. 

The authors begin by recounting a number of results shown in the Vegas and 
FAST research that suggest that delay-based congestion control schemes tend to 
have better utilization, less self-induced packet loss, faster convergence (to the 
correct operating point), plus better RTT fairness and stabilization. However, as 
mentioned previously, delay-based approaches tend to lose bandwidth when com¬ 
peting with loss-based congestion control approaches. CTCP attempts to address 
this situation by combining a delay-based approach with a loss-based approach. 
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To do this, CTCP introduces a new window control variable called diund (the 
"delay window"). The usable window W then becomes 

W = min{cwnd + dwnd, awnd) 

The handling of cwnd is similar to that of sfandard TCP, buf fhe addifion of dwnd 
may allow addifional packefs fo be senf if fhe delay condifions are appropriafe. 
When ACKs arrive during congesfion avoidance, cwnd is updafed as follows: 

cwnd = cwnd + \/{cwnd + dwnd) 

The managemenf of dwnd is based on Vegas and is nonzero only during con¬ 
gesfion avoidance (CTCP uses convenfional slow sfarf). As a connecfion operafes, 
fhe minimum RTT measured is mainfained in fhe variable baseRTT. Then, fhe dif¬ 
ference in expecfed dafa oufsfanding versus fhe acfual amounf, diff, is compufed 
as follows: diff = W’^(l - {baseRTT/RTT)), where RTT is fhe esfimafed (smoofhed) 
RTT esfimafe. The value of esfimafes fhe number of packefs (or byfes) queued 
in fhe nefwork. CTCP, like mosf delay-based schemes, affempfs fo keep diff af a 
cerfain fhreshold, called y, in order fo ensure fhaf fhe nefwork remains ufilized 
buf nof congesfed. Given fhis goal, fhe confrol process for dwnd is fhen expressed 
as follows: 


dwnd{t + 1) = 


dwnd{t) + {a* win{tf - l)h if diff< y 
{dwnd{t) diff)*, if diff> y 
fwin{t) (1 - P) - cwnd/iy, if loss defecfed 


where (x)+ means max(j, 0). Nofe fhaf dwnd can never be negafive. Rafher, if may 
be zero, in which case CTCP behaves like sfandard TCP 

In fhe firsf case, where fhe nefwork may be underufilized, CTCP grows dwnd 
according fo fhe polynomial a win{tY. This is a form of binomial increase and 
accounfs for fhe way CTCP can be made more aggressive (similar fo HSTCP) when 
fhe buffer occupancy is esfimafed fo be less fhan y. In fhe second case, where fhe 
buffer occupancy appears fo be growing beyond fhe desired fhreshold y, fhe con- 
sfanf dicfafes how quickly fhe delay-based componenf should be reduced (buf 
recall fhaf dwnd is always added fo cwnd). This is whaf confribufes fo CTCP's RTT 
and TCP fairness. When loss is defecfed, dwnd has ifs own mulfiplicafive decrease 
factor P applied. 

As can be seen, CTCP can be funed using fhe paramefers k, a, P, y, and The 
value of k affecfs fhe level of aggressiveness. A value of abouf 0.8 was desired fo 
be similar fo HSTCP, buf 0.75 was chosen for implemenfafion reasons. The values 
of a and P affecf smoofhness and responsiveness. The defaulf values are 0.125 and 
0.5, respecfively. For y fhe aufhors suggesf a defaulf value of 30 packefs based on 
empirical evaluafion. If fhis value is foo small, fhere may nof be enough packefs 



Section 16.10 Buffer Bloat 


781 


outstanding to obtain good delay measurements. Conversely, values that are too 
large could result in undesirable persistent congestion. 

CTCP is relatively new, so further experimentation and evaluation will no 
doubt be performed fo see how well and fairly if compefes wifh sfandard TCP, 
and how well if is able fo adapf fo significanf changes in available bandwidfh. In a 
simulafion sfudy, fhe aufhor of [W08] nofed fhaf CTCP can perform poorly when 
nefwork buffers are small (i.e., smaller fhan y). They also suggesf fhaf CTCP can 
fall vicfim fo some of fhe problems wifh Vegas, including reroufing (adapfing fo 
new pafhs wifh differenf delays) and persisfenf congesfion. Finally, fhey observe 
fhaf if many CTCP flows, each frying fo keep y packefs in flighf, share fhe same 
boffleneck link, performance can be poor. 

As menfioned previously, CTCP is nof enabled by defaulf on mosf versions of 
Windows. However, fhe following command can be used fo selecf CTCP as fhe 
congesfion provider: 


C:\> netsh interface tcp set global congestionprovider=ctcp 


If can be disabled by selecfing a differenf provider (or none). CTCP has also been 
porfed fo Linux as a pluggable congesfion avoidance module buf is nof included 
by defaulf. 


16.10 Buffer Bloat 

Alfhough memory has fradifionally been expensive (and remains so for high- 
end roufers), if is now commonplace fo find commodify nefworking equipmenf 
fhaf confains a significanf amounf of memory, pofenfially mulfiple megabyfes of 
packef buffers. Perhaps ironically, fhis large amounf of memory (as compared fo 
fradifional nefworking devices) can acfually lead fo degraded performance for 
profocols such as TCP This problem has been fermed buffer bloat [G11][DHGS07]. 
If relafes fo high amounfs of lafency infroduced by queuing delay, primarily af fhe 
uplink side of residenfial gafeways and access poinfs in homes and small offices. 
The sfandard TGP congesfion confrol algorifhms, which fend fo keep buffers full 
af boffleneck links, do nof operafe well when a large amounf of buffering occurs 
befween fhe sender and receiver because fhe congesfion indicator (a packef drop) 
fakes a long fime fo be delivered fo a sender. 

In [KWNPIO], fhe aufhors find fhaf upload bandwidfh in fhe United Sfafes 
over cable and DSL ranges from abouf 256Kb/s fo 4Mb/s. They also inferred buf¬ 
fer sizing on commodify roufers in fhe range from 16KB fo 256KB. Figure 16-21 
shows how lafency relafes fo dafa rafe for several buffer sizes fo help provide a 
perspecfive on fhese findings. 

In fhis figure, fhe log-log graph displays fhe amounf of lafency experienced 
by dafa required fo queue for various buffer sizes (1KB-2MB). Residenfial Interne! 
upload bandwidfh rafes (fypically 250Kb/s fo lOMb/s) can lead fo latencies in fhe 
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Figure 16-21 The log-log plot shows the latency due to queuing delay experienced by data in fully 
congested queues of various sizes. When large buffers remain full ("buffer bloat"), 
interactive applications can experience unacceptable latencies in the multiple-second 
range. 


multiple-second range if buffers are sized fo be a few hundred kilobyfes or more. 
Inferacfive applicafions generally require one-way delays fo be below 150ms fo 
provide a good qualify of experience fo users [G114]. Thus, if buffers remain filled 
fo capacify because of one or more large compefing uploads (e.g., BifTorrenf file 
sharing), inferacfive applicafions can be adversely affecfed. 

Buffer bloaf is nof a problem in all nefworking equipmenf. Indeed, fhe primary 
concern appears fo be in overbuffered end-user access devices. There are mul- 
fiple pofenfial ways fo deal wifh fhe issue, including protocol modificafions (e.g., 
delay-based congesfion confrol such as Vegas, buf if may be negafively affecfed by 
high jiffer [DHGS07]), dynamic buffer sizing af fhe access devices (suggested in 
[KWNPIO]), or a combinafion of fhe fwo. We nexf furn fo a combinafion approach 
fhaf may help fhe buffer bloaf problem buf also has a number of ofher benefifs. 


16.11 Active Queue Management and ECN 

The discussion of TGP's congesfion response so far has assumed fhaf fhe only way 
a TGP infers fhaf congesfion is happening is observafion of packef drops. In par- 
ficular, routers (fhe fhings fhaf are mosfly likely fo become congesfed) do nof ordi¬ 
narily help inform fhe TGP af each hosf fhaf congesfion is imminenf. Insfead, fhey 
simply drop arriving packefs when no more buffer space is available (called "drop 
fail") and send packefs fhaf have already arrived in a firsf-in-firsf-ouf (FIFO) man¬ 
ner. When Infernef routers are passive like fhis (fhaf is, fhey simply discard packefs 
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when overloaded and provide no feedback regarding their congestion state), there 
is little a TCP can do other than react after the fact. If, however, these routers had 
a way to more actively manage their queues (i.e., by using a more sophisticated 
scheduling and buffer management policy than FIFO/drop tail), perhaps the situ¬ 
ation could be improved. If they could also signal their congestion state to TCP 
endpoints, so much the better. 

Routers that apply scheduling and buffer management policies other than 
FIFO/drop tail are usually said to be active, and the corresponding methods they 
use to manage their queues are called active queue management (AQM) mechanisms. 
The authors of [RFC2309] provide a discussion of the potential benefits of AQM. 
Although AQM can be useful independently, it becomes more useful when rout¬ 
ers and switches implementing AQM have a common method for conveying their 
status to the end systems. For TCP, this is described in [RFC3168] and extended 
with additional security in an experimental specification [RFC3540]. These RFCs 
describe Explicit Congestion Notification (ECN), which is a way for routers to 
mark packets (by ensuring both of the ECN bits in the IP header are set) to indicate 
the onset of congestion. 

Random Early Detection (RED) gateways [EJ93] are one mechanism suggested 
as being capable of detecting the onset of congestion and controlling the mark¬ 
ing of packets. These gateways implement a queue management discipline that 
measures the average queue occupancy over time. If the occupancy exceeds the 
minimum (called minthresh) and is less than the maximum (called maxthresh), a 
packet is marked with an increasing probability. If the average queue occupancy 
exceeds maxthresh, packets are marked with a configurable maximum probability 
(called MaxP), which could be 1.0. RED can also be configured to drop packets 
instead of marking them. 


Note 

The RED algorithm is the basis for a number of variants (e.g., Cisco’s WRED, 
which uses different RED instances based on IP DSCP or precedence values) 
that are supported on many routers and switches. 


When received by a TCP, a congestion mark indicates that the packet has passed 
through a congested router. Qf course, it is the sender (rather than the receiver) that 
really needs this information in order to react by slowing down. Thus, the receiver 
echoes this indication back to the sender in a series of ACK packets. 

The ECN mechanism operates partially at the IP layer and so is potentially 
applicable to transport protocols other than TCP, although most of the work on 
ECN has been with TCP, and it is what we discuss here. When an ECN-capable 
router experiencing persistent congestion receives an IP packet, it looks in the 
IP header for an ECN-Capable Transport (ECT) indication (currently defined as 
either of the two ECN bits in the IP header being set). If set, the transport protocol 
responsible for sending the packet understands ECN. At this point, the router sets 
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a Congestion Experienced indication in the IP header (by setting both ECN bits to 1) 
and forwards the datagram. Routers are discouraged from seffing a CE indicafion 
when congesfion does nof appear fo be persisfenf (e.g., upon a single recenf packef 
drop due fo queue overrun) because fhe fransporf profocol is supposed fo reacf 
given even a single CE indicafion. 

The TCP receiver observing an incoming dafa packef wifh a CE sef is obliged 
fo refurn fhis indicafion fo fhe sender (fhere is an experimenfal exfension fo add 
ECN fo SYN + ACK segmenfs as well [REC5562]). Because fhe receiver normally 
refurns informafion fo fhe sender by using (unreliable) ACK packefs, fhere is a 
significanf chance fhaf fhe congesfion indicator could be losf. Eor fhis reason, TCP 
implemenfs a small reliabilify-enhancing profocol for carrying fhe indicafion back 
fo fhe sender. Upon receiving an incoming packef wifh CE sef, fhe TCP receiver sefs 
fhe ECN-Echo bif field in each ACK packef if sends unfil receiving a CWR bif field 
sef fo 1 from fhe TCP sender in a subsequenf dafa packef. The CWR bif field being 
sef indicates fhaf fhe congesfion window (i.e., sending rate) has been reduced. 


Note 

Although RED and ECN have been known for nearly two decades, they have not 
seen widespread Internet deployment. A variety of reasons have been asserted 
as to why (e.g., difficulty in setting RED parameters, a perception of limited ben¬ 
efits). In 2005, a “reexamination” of ECN [K05] pointed out that using ECN on 
only data packets limits its benefits substantially. An experimental extension 
[RFC5562] defines the use of ECN in SYN -i- ACK packets with the possibility of 
greatly increasing the utility of ECN for certain workloads (e.g., Web traffic). 


A sending TCP receiving an ECN-Echo indicator in an ACK reacts the same 
way it would when detecting a single packet drop by adjusting cwnd, and it also 
arranges to set the CWR bit field in a subsequent data packet. The prescribed con¬ 
gestion response of the fast retransmit/recovery algorithms is invoked (of course, 
without the packet retransmission), causing the TCP to slow down prior to suffer¬ 
ing packet drops. Note that the TCP should not overreach; in particular, it should 
not react more than once for the same window of data. Doing so would overly 
penalize an ECN TCP relative to others. 

In Windows Vista and later, ECN needs to be enabled to be used: 


C:\> netsh int tcp set global ecncapability=enabled 


In Linux, ECN is enabled if the Boolean sysctl variable net. ipv4. tcp_ecn 
is nonzero. The default varies based on which Linux distribution is used, with 
off being most common. On Mac OS 10.5 and later, the variables net. inet. tcp. 
ecn_initiate_out and net. inet .tcp.ecn_negotiate_in control whether 
ECN is enabled for outgoing traffic and for incoming traffic with ECN flags set, 
respectively. Of course, without cooperation from routers or switches, the utility 
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of ECN is limited in any case. Only time will tell if the vision for AQM will ever be 
fully realized in the global Internet. 


Note 

RED and ECN have been used successfully in a radically different operating envi¬ 
ronment from that for which they were designed. Microsoft and Stanford have 
developed Data Center TCP (DCTCP) [A10], which uses RED impiemented in 
iayer 2 switches with simpiified parameters to mark packets when instantaneous 
congestion is experienced. They aiso modify the TCP receiver behavior to set 
ECN-Echo in ACKs only when the last received packet contains a CE mark. They 
report a 90% reduction in buffer occupancy for comparable TCP throughput, 
allowing a tenfold increase in background traffic to be supported. 


16.12 Attacks Involving TCP Congestion Control 

We have seen already how TCP can be attacked by generating packets that cause 
TCP's connection state machine to terminate the connection. TCP can also be 
attacked (or at least induced to behave in peculiar ways) when operating in the 
ESTABLISHED state. Most attacks on TCP congestion control attempt to force a 
TCP to send faster or slower than it would under ordinary circumstances. 

Perhaps the earliest attack involves the fabrication of ICMPv4 Source Quench 
messages. When these are delivered to a host running TCP, any connection to the 
IP address contained in the offending datagram inside the ICMP message slows 
down. While this may have been a vulnerability some years back, using Source 
Quench messages for congestion control has been deprecated for use by routers 
since about 1995 (via [RPC1812], Section 5.3.6). Qn the other hand, for end hosts, 
[RPC1122] stated that a TCP must react to a Source Quench by slowing down. 
Combining these two facts, the simplest solution is to block ICMP Source Quench 
traffic at the router or host, and this is now common. 

A more sophisticated and more recent set of attacks have been considered 
by looking at misbehaving receivers [SCWA99]. The authors describe three types of 
attacks that can cause a TCP sender to inject data at a rate faster than intended. 
Such attacks could be used, for example, to cause a Web client to have an unfair 
advantage over competing clients. The attacks are named ACK division, DupACK 
spoofing, and Optimistic ACKing and are implemented in a TCP variant the authors 
(jokingly) call "TCP Daytona." 

ACK division operates by producing more than one ACK for the range of 
bytes being acknowledged. Because the TCP congestion control typically operates 
based on the arrival of ACK packets (rather than the ACK field contained in the 
ACK itself), a sending TCP can be induced to increase cwnd faster than it would 
otherwise. This problem can be mitigated by basing the congestion control com¬ 
putations on the amount of data acknowledged rather than the arrival of a packet, 
as is done with ABC. 
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DupACK spoofing causes a sender to increase its congestion window during 
fast recovery. Recall from the previous discussion that during standard fast recov¬ 
ery, cwnd is incremented for each duplicate ACK received. The attack involves cre¬ 
ating extra duplicate ACKs that cause this to happen more quickly than intended. 
This attack is more difficult to defend against, because there is no clean way to 
map received duplicate ACKs to the segments they acknowledge (a nonce, an asso¬ 
ciated value that changes with time, which we discuss in Chapter 18, would solve 
this problem). While the Timestamps option relates to this problem, it is an option 
and can be disabled on a per-connection basis. The best approach to addressing 
this problem appears to be modification of the sender side to limit the amount of 
outstanding data during recovery. 

Optimistic ACKing involves producing ACKs for segments that have not yet 
arrived. Because TCP's congestion control computations are based on end-to-end 
RTTs, ACKing data that has not yet arrived causes the sender to react faster than 
it would because it is fooled into believing the actual RTT is smaller. Furthermore, 
there is little penalty for doing this, as a sender typically ignores ACKs for data it 
has not yet sent. While this approach does not preserve data reliability at the TCP 
layer as the other attacks do (i.e., ACKed data could still be lost), it is frequently 
the case (e.g., in HTTP/1.1) that missing data can be reconstructed by an applica¬ 
tion- or session-layer protocol. The authors describe a cumulative nonce that can 
address this problem and a way to alter the sizes of sent segments over time to 
better match up ACKs with sent segments. When the ACKs do not correspond, 
the sender can take action. 

The problems described for misbehaving receivers have also received atten¬ 
tion with respect to ECN by some of the same authors. Recall that with AQM using 
ECN, the TCP receiver returns the ECN indication to the sender in an ACK. The 
sender is then supposed to respond by slowing down. If the receiver fails to return 
the ECN indications to the sender (or routers in the network clear the indicators), 
the sender would never be informed of congestion and would not slow down. In 
[REC3540], the authors describe an experimental way to use the ECT bit field of 
the ECN field (2 bits) of an IP packet as a form of nonce. The sender places a ran¬ 
dom binary value in the field, and the receiver returns a 1-bit sum (an XOR opera¬ 
tion) of the values of this field over time. When generating an ACK, the receiver 
places the sum bit 7 of the TCP header (currently reserved as zero). A misbehaving 
receiver has a 50/50 chance of guessing the sum. Because each packet represents 
an independent trial and a successful misbehaving receiver must have every sum 
correct, its chance of doing so is 1/2*^ for k packets (vanishingly small for a connec¬ 
tion of any reasonable duration). 


16.13 Summary 

TCP was designed as the primary reliable transport protocol for the Internet. 
Although its initial design included a flow control capability, used to cause a 
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sender to slow down when a receiver could not keep up, no provision was made 
initially for preventing the sender from overwhelming fhe nefwork in befween. In 
fhe lafe 1980s, fhe slow sfarf and congesfion avoidance algorifhms were developed 
fo regulafe a TCP sender's aggressiveness so as fo avoid losing packefs because 
of congesfion in fhe nefwork. These algorifhms depend on using an implicif sig¬ 
nal, packef loss, and an indicafor of congesfion. They are friggered when loss is 
defecfed, eifher by fhe fasf refransmif algorifhm or by refransmission fimeoufs. 

Slow sfarf and congesfion avoidance regulafe a sender's operaf ion by inf reduc¬ 
ing a congesfion window af fhe sender. This is used in conjuncfion wifh fhe con- 
venfional window (based on window adverfisemenfs provided by fhe receiver). A 
sfandard TCP limifs ifs window fo fhe minimum of fhe fwo. Slow sfarf grows fhe 
value of fhe congesfion window exponenfially wifh fime, and congesfion avoid¬ 
ance grows if abouf linearly wifh fime. Only one of fhe fwo algorifhms is in opera- 
fion af any one fime, and fhis decision is made by comparing fhe currenf value 
of fhe congesfion window fo fhe slow sfarf fhreshold: if fhe congesfion window 
exceeds fhe fhreshold, congesfion avoidance fakes confrol; ofherwise, slow sfarf 
is used. Slow sfarf is used inifially when TCP esfablishes ifs connecfion and affer 
resfarf condifions due fo fimeoufs. If can also be used when a connecfion has gone 
idle for a significanf amounf of fime. The slow sfarf fhreshold is adjusfed dynami¬ 
cally during fhe course of fhe connecfion. 

Congesfion confrol has been a significanf focus of fhe nefworking research 
communify over fhe years. Affer more experience was gained wifh TCP and ifs 
slow sfarf and congesfion avoidance procedures, a number of improvemenfs have 
been suggesfed, implemenfed, and sfandardized. By keeping frack of when TCP 
is recovering from a collecfion of losf packefs, fhe NewReno varianf of TCP avoids 
some of fhe sfalls fhaf can occur wifh Reno varianfs when mulfiple packefs are 
dropped in a single window of dafa. SACK TCP can improve upon NewReno's 
behavior by permiffing fhe sender fo infelligenfly repair more fhan one packef 
drop per RTT. Wifh SACK TCP, careful accounfing musf be esfablished fo ensure 
fhaf fhe sender is nof overly aggressive wifh respecf fo ofher TCPs wifh which if 
may be sharing an Infernef pafh. 

Some of fhe more recenf changes fo TCP congesfion managemenf include rafe 
halving, congesfion window validafion and moderafion, and "undo" procedures. 
The rafe-halving algorifhm causes fhe congesfion window fo reduce gradually 
affer defecfed loss evenfs insfead of reducing if immediafely. Congesfion window 
validafion fries fo ensure fhaf fhe congesfion window is nof overly large if a send¬ 
ing applicafion has been idle or unable fo send for some fime. Congesfion window 
moderafion limifs fhe size of a bursf in response fo fhe receipf of a single ACK. The 
"undo" procedures, such as fhe Eifel Response Algorifhm, undo congesfion win¬ 
dow modificafions if fhe packef loss signal is deemed fo be spurious, a condifion 
defecfable using a number of fechniques. In such cases, fhe negafive impacf on 
performance by reducing fhe congesfion window is minimized by restoring fhe 
congesfion sfafe fo ifs condifion prior fo fhe reducfion of fhe congesfion window. 
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After significant experience with TCP, it was observed that the congestion 
avoidance procedure can take a long time to find and exploit additional band¬ 
width that becomes available. As a result, numerous proposals for "bandwidth- 
scalable" TCP variants have been made. One of the better-known versions (within 
the IETF) is HSTCP, which allows the congestion window to grow much more 
aggressively in operating regimes where few packets are dropped and windows 
are large, as compared with conventional TCP. Subsequent suggestions have 
included FAST and CTCP, which base their window growth procedures on packet 
loss and latency measures. Widely deployed in Linux, the BIC-TCP and CUBIC 
algorithms use growth functions that are convex in some portions and concave in 
others. This supports small window changes during the saturation point, leading 
to enhanced stability at the possible cost of somewhat sluggish response to new 
available bandwidth (but still faster than standard TCP). 

A significant change to the operation of TCP and Internet routers has been 
proposed with the specification of Explicit Congestion Notification (ECN), which 
would allow TCP to detect the onset of congestion before a packet loss is experi¬ 
enced. Although simulations and research results have shown this to be desirable, 
it requires a moderate change to TCP implementations and a significant change to 
the way Internet routers operate. The extent to which this capability is deployed 
remains to be seen. 

Although TCP provides the most widely used method for reliably moving 
data on the Internet, it does not implement much in the way of its own security. It 
is generally vulnerable to packet-forging attacks that can cause disruptions of con¬ 
nections; an attacker need only have a good guess at a viable (in-window) sequence 
number to launch such attacks. In addition, modification of the ACK stream (or 
ECN bits, if they are supported) can induce a sender to behave in ways that are 
unfair to other TCP connections. Furthermore, nothing physically prevents an 
overly aggressive sender from simply violating all congestion control rules. 

Combining all of the various algorithms and techniques developed for TCP 
into a single TCP implementation is not an easy task (Linux 2.6.38 TCP/IPv4 is 
about 20,000 lines of C code), and analyzing traces of a real-world TCP in action 
can be time-consuming. Tools such as tcpdump, Wireshark, and tcptrace make 
this job considerably easier. Because of its dynamic adaptation to the performance 
of the network, understanding TCP's behavior is most easily accomplished with 
visualization techniques based on time-series plots, such as those used in this 
chapter. 
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17.1 Introduction 

Many newcomers to TCP/IP are surprised to learn that no data whatsoever flows 
across an idle TCP connection. That is, if neifher process af fhe ends of a TCP con- 
necfion is sending dafa fo fhe ofher, nofhing is exchanged befween fhe fwo TCP 
endpoinfs. There is no polling, for example, as you mighf find wifh ofher nefwork- 
ing profocols. This means fhaf we can sfarf a clienf process fhaf esfablishes a TCP 
connecfion wifh a server and walk away for hours, days, weeks, or monfhs, and 
fhe connecfion should remain up. In fheory, infermediafe roufers can crash and 
reboof, dafa lines may go down and back up, buf as long as neifher hosf af fhe 
ends of fhe connecfion reboofs (or changes ifs IP address), fhe connecfion remains 
esfablished. This is how TCP/IP was designed. 


Note 

The previous statement assumes that neither appiication—neither the client nor 
the server—has application-levei timers to detect inactivity, causing either appii¬ 
cation to terminate. It also assumes that no intermediate router is keeping state 
about the connection (such as a NAT box) that is required for proper operation 
that it might deiete because of inactivity or iose because of system failure. In 
today’s Internet, these are big assumptions. 


Under some circumstances, it is useful for a client or server to become aware 
of the termination or loss of connection with its peer. In other circumstances, it is 
desirable to keep a minimal amount of data flowing over a connection, even if the 
applications do not have any to exchange. TCP keepalive provides a capability use¬ 
ful for both cases. Keepalive is a method for TCP to probe its peer without affect¬ 
ing the content of the data stream. It is driven by a keepalive timer. When the timer 
fires, a keepalive probe {keepalive for short) is sent, and the peer receiving the probe 
responds with an ACK. 
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Note 

Keepalives are not part of the TCP specification. The Host Requirements RFC 
[RFC1122] says that this is because they could (1) cause perfectiy good con¬ 
nections to break during transient Internet failures, (2) consume unnecessary 
bandwidth, and (3) cost money for an Internet path that charges for packets. Nev¬ 
ertheless, most implementations provide the keepalive capability. 


TCP keepalive is a controversial feature. Many feel fhaf polling of fhe ofher 
end has no place in TCP and should be done by fhe applicafion, if desired. On 
fhe ofher hand, if many applicafions require such funcfionalify if is convenienf 
fo place if in TCP so fhaf ifs implemenfafion can be shared. The keepalive is an 
opfionally enabled feafure fhaf can cause an ofherwise good connecfion befween 
fwo processes fo be ferminafed because of a femporary loss of connecfivify in fhe 
nefwork joining fhe fwo end sysfems. For example, if fhe keepalive probes are senf 
during fhe fime fhaf an infermediafe roufer has crashed and is reboofing, TCP 
incorrecfly fhinks ifs peer hosf has crashed. 

The keepalive feafure was originally infended for server applicafions fhaf 
mighf fie up resources on behalf of a clienf and wanf fo know if fhe clienf hosf 
crashes or goes away. Using TCP keepalive fo defecf dead clienfs is mosf useful for 
servers fhaf expecf fo have a relafively shorf-durafion dialogue wifh a noninferac- 
five clienf (e.g., Web servers, POP and IMAP e-mail servers). Servers implemenf- 
ing more inferacfive-sfyle services fhaf lasf for a long fime (e.g., remofe login such 
as ssh and Windows Remofe Desktop) mighf wish fo avoid using keepalives. 

A common example showing fhe ufilify of fhe keepalive feafure nowadays is 
when a user uses fhe ssh (secure shell) remofe login program fo log in fo a remofe 
hosf fhrough a NAT roufer. If fhe user were fo esfablish fhe connecfion, do some 
work, fhen jusf power off fhe compufer af fhe end of fhe day, wifhouf logging off, 
a half-open connecfion would be leff. In Chapfer 13 we showed fhaf sending dafa 
across a half-open connecfion causes a resef fo be refurned, buf fhaf was from fhe 
server end, where fhe clienf was sending fhe dafa. If fhe clienf disappears, leaving 
fhe half-open connecfion on fhe server's end, and fhe server is waifing for some 
dafa from fhe clienf, fhe server will waif forever. The keepalive feafure is infended 
fo defecf fhese half-open connecfions from fhe server side. 

Anofher reason for using keepalives is somewhaf fhe reverse. If fhe user does 
nof power off fhe compufer buf instead leaves a connecfion open all nighf (and 
wishes fo confinue using if fhe nexf day), fhe connecfion goes idle for many hours. 
In Chapfer 7 we discussed how mosf NAT roufers include a fimeouf mechanism 
fhaf flushes fhe sfafe of a connecfion after some period of inacfivify. If fhe NAT 
fimeouf is less fhan fhe several hours before fhe user ref urns fo use fhe login ses¬ 
sion, and fhe NAT is nof smarf enough fo probe fhe end sfafion fo make sure if is 
sfill acfive, or fhe NAT crashes, fhe connecfion is ferminafed. To avoid fhis common 
problem, ssh can be configured fo use TCP keepalives. ssh also has fhe abilify fo 
use application-managed keepalives, and fhe fwo behave differenfly, especially wifh 
respecf fo fheir securify properfies. (Please see Secfion 17.3 for more on fhis.) 
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17.2 Description 

Either end of a TCP connection may request keepalives, which are turned off by 
default, for their respective direction of the connection. A keepalive can be set 
for one side, both sides, or neither side. There are several configurable param¬ 
eters that control the operation of keepalives. If there is no activity on the connec¬ 
tion for some period of time (called the keepalive time), the side(s) with keepalive 
enabled sends a keepalive probe to its peer(s). If no response is received, the probe 
is repeated periodically with a period set by the keepalive interval until a number of 
probes equal to the number keepalive probes is reached. If this happens, the peer's 
system is determined to be unreachable and the connection is terminated. 

A keepalive probe is an empty (or 1-byte) segment with sequence number 
equal to one less than the largest ACK number seen from the peer so far. Because 
this sequence number has already been ACKed by the receiving TCP, the arriving 
segment does no harm, but it elicits an ACK that is used to determine whether the 
connection is still operating. Neither the probe nor its ACK contains any new data 
(it is "garbage" data), and neither is retransmitted by TCP if lost. [RFC1122] dictates 
that because of this fact, the lack of response for a single keepalive probe should 
not be considered sufficient evidence that the connection has stopped operating. 
This is the reason for the keepalive probes parameter setting mentioned previously. 
Note that some (mostly older) TCP implementations do not respond to keepalives 
lacking the "garbage" byte of data. 

Anytime it is operating, a TCP using keepalives may find its peer in one of 
four states: 

1. The peer host is still up and running and reachable. The peer's TCP 
responds normally and the requestor knows that the other end is still up. 
The requestor's TCP resets the keepalive timer for later (equal to the value 
of the keepalive time). If there is application traffic across the connection 
before the next timer expires, the timer is reset back to the value of keepalive 
time. 

2. The peer's host has crashed and is either down or in the process of reboot¬ 
ing. In either case, its TCP is not responding. The requestor does not receive 
a response to its probe, and it times out after a time specified by the keepalive 
interval. The requestor sends a total of keepalive probes of these probes, kee¬ 
palive interval time apart, and if it does not receive a response, the requestor 
considers the peer's host as down and terminates the connection. 

3. The client's host has crashed and rebooted. In this case, the server receives a 
response to its keepalive probe, but the response is a reset segment, causing 
the requestor to terminate the connection. 

4. The peer's host is up and running but is unreachable from the requestor for 
some reason (e.g., the network cannot deliver traffic and may or may not 
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inform the peers of this fact using ICMP). This is effectively the same as 
state 2, because TCP cannot distinguish between the two. All TCP can tell 
is that no replies are received to its probes. 

The requestor does not have to worry about the peer's host being shut down 
gracefully and then rebooting (as opposed to crashing). When the system is shut 
down by an operator, all application processes are terminated (i.e., the peer's pro¬ 
cess), which causes the peer's TCP to send a FIN on the connection. Receiving the 
FIN would cause the requestor's TCP to report an end-of-file to the requestor's 
process, allowing the requestor to detect this scenario and exit. 

In the first state the requestor's application has no idea that keepalive probes 
are taking place (except that it chose to enable keepalives in the first place). Every¬ 
thing is handled at the TCP layer. It is transparent to the application until one 
of states 2, 3, or 4 is determined. In these three cases, an error is returned to the 
requestor's application by its TCP. (Normally the requestor has issued a read from 
the network, waiting for data from the peer. If the keepalive feature returns an 
error, it is returned to the requestor as the return value from the read.) In sce¬ 
nario 2 the error is something like "Connection timed out," and in scenario 3 we 
expect "Connection reset by peer." The fourth scenario may look as if the connec¬ 
tion timed out, or may cause another error to be returned, depending on whether 
an ICMP error related to the connection is received and how it is processed (see 
Chapter 8). We look at all four scenarios in the next section. 

The values of the variables keepalive time, keepalive interval, and keepalive probes 
can usually be changed. Some systems allow these changes on a per-connection 
basis, while others allow them to be set only system-wide (or both in some cases). 
In Linux, these values are available as sysctl variables with the names net. ipv4 
.tcp_keepalive_time, net.ipv4.tcp_keepalive_intvl, and net.ipv4 
.tcp_keepalive_probes, respectively. The defaults are 7200 (seconds, or 2 
hours), 75 (seconds), and 9 (probes). 

In FreeBSD and Mac OS X, the first two values are also available as sysctl 
variables called net.inet.tcp.keepidle and net.inet.tcp.keepintvl, 
with default values 7,200,000 (milliseconds, or 2 hours) and 75,000 (milliseconds, 
or 75s), respectively. These systems also have a Boolean variable called net. inet 
.tcp.always_keepalive. If this value is enabled, all TCP connections have the 
keepalive function enabled, even if the application did not request it. In these sys¬ 
tems, the number of probes is a fixed default value: 8 (FreeBSD) or 9 (Mac OS X). 

In Windows, these values are available for modification via registry entries 
under the system key: 


HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters 


The value KeepAliveTime defaults to 7,200,000ms (2 hours); KeepAlive- 
Interval defaults to 1000ms (Is). If there is no response to ten keepalive probes, 
Windows terminates the connection. 
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Note that [RFC1122] places certain restrictions on the use of keepalives. In 
particular, the keepalive time must be configurable and musf nof defaulf fo less 
fhan 2 hours. In addifion, keepalives musf nof be enabled unless an applicafion 
requesfs one (alfhough fhis behavior is violafed if fhe net.inet.tcp.alwaYs_ 
keepalive variable is sef). Linux does nof provide a nafive facilify for adding 
keepalives fo applicafions fhaf do nof requesf if, buf a special library can be pre- 
loaded (i.e., loaded prior fo ordinary shared libraries) fo gef fhis effecf [LKA]. 

17.2.1 Keepalive Examples 

We shall now go fhrough sfafes 2, 3, and 4 from fhe previous secfion, fo see fhe 
packefs exchanged using fhe keepalive mechanism. The operafion in sfafe 1 will 
be illusfrafed in fhe course of looking af fhe ofhers. 

17.2.1.1 Other End Crashes 

Lef us see whaf happens when fhe server hosf crashes and does nof reboof. To 
simulafe fhis we will do fhe following sfeps: 

1. Using fhe regedit program on a Windows clienf, modify fhe regisfry key, 
and sef KeepAliveTime fo 7000ms (7s). This may require fhe sysfem fo be 
reboofed fo accepf fhe new value. 

2. Esfablish an ssh connecfion befween fhe Windows clienf and a Linux 
server using an opfion fhaf enables TCP keepalives. 

3. Verify fhaf dafa can go across fhe connecfion. 

4. Wafch fhe clienf's TCP send keepalive packefs every 7s, and see fhem 
acknowledged by fhe server's TCP. 

5. Disconnecf fhe nefwork cable from fhe server, and leave if disconnecfed 
unfil fhe example is complefe. This makes fhe clienf fhink fhe server hosf 
has crashed. 

6. We expecf fhe clienf fo send fen keepalive probes. Is aparf, before declaring 
fhe connecfion dead. 

Plere is fhe inferacfive oufpuf on fhe clienf: 


C:\> ssh “O TCPKeepAlive=yes 10.0.1.1 

(password prompt and login continues) 

Write failed: Connection reset by peer (about 15 seconds after disconnect) 


Figure 17-1 shows fhe resulfs using Wireshark. In fhis example, fhe connec¬ 
fion has already been esfablished. The Wireshark oufpuf begins wifh a keepalive 
(packef 1) fhaf is nof idenfified as such. Af fhis poinf, Wireshark has nof processed 
enough packefs fo defermine fhaf fhe one sequence number in packef 1 is below 
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7 wnn7-linux-keepalive-djsconnect.td - Wireshark 


I File Edit View Go Capture Analyze Statistics Telephony lools Help 


HUB 




Time Source 

10.000000 10.0.1.37 

2 0.000384 10.0.1.1 


Destination 
10 . 0 . 1.1 
10.0.1.37 


49192 > 22 [ACK] Seq=l Ack=l Win=65407 Len=l 

22 > 49192 [ACK] Seq=l Ack=2 win=11872 Len=0 SLE=1 SRE=2 

■ art t pia (jaSi CSM', 



9 26.090637 

10 26. 090964 

11 26.091840 


15 33.088630 

16 40.092657 


19 47. 097554 

20 54.101711 


49192 > 22 [PSH, ACK] Ssq=2 a:! =1 w1n=65407 Len=48 

22 > 49192 [ACK] Seq=l Ack=50 win=11872 Len=0 

22 > 49192 [PSH, ACK] Seq=l Ack=5Q Win=11872 Len=48 


10.0.1.1 10.0.1.37 TCP [TCP Keep-Alive ACK] 22 > 49192 [ACK] Seq=49 Ack=50 Win=11872 Len=0 SLE=49 SRE=50 

10.0.1.37 10.0.1.1 TCP [TCP Keep-Alive] 49192 > 22 [ACK] Seq=49 Ack=49 Win=65359 Len=l 






10.0.1.1 10.0.1.37 TCP 

10.0.1.37 10.0.1.1 TCP 


[TCP Keep-Alive ACKJ 22 > 49192 [ACK] Seq«49 ACk»50 Win-11872 Len»0 SLE°49 SRE-50 

[TCP Keep-Alive] 49192 > 22 [ACK] 5eq=49 ACk ■ -- ^ 



30 69.218316 

31 70.232401 


10.0.1.37 10.0.1.1 

10.0.1.37 10.0.1.1 


[TCP Keep-Alive] 49192 > 22 [ACK] Seq=49 Ack=49 Win=65359 Len=l 

[TCP Keep-Alive] 49192 > 22 [ACK] Seq=49 Ack=49 Win=65359 Len=l 




13 Frame 3: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) 

a Ethernet II, Src: 00:30:67:18:da:58 (00:30:67:18:da:58), Dst: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80) 
a Internet Protocol, Src: 10.0.1.37 (10.0.1.37), Dst: 10.0.1.1 (10.0.1.1) 

a Transmission Control Protocol, Src Port: 49192 (49192), Dst Port: 22 (22), Seq: 1, Ack: 1, Len: 1 
source port: 49192 (49192) 

Destination port: 22 (22) 

[Stream index: 0] 

Sequence number: 1 (relative sequence number) 

[Next sequence number: 2 (relative sequence number)] 

Acknowledgement number: 1 (relative ack number) 

Header length: 20 bytes 
a Flags: Oxio (ack) 
window size: 65407 
a checksum: 0x07d3 [correct] 
a [SEQ/ACK analysis] 

[Number of bytes in flight: 1] 
a [TCP Analysis Flags] 

a [This is a TCP keep-alive segment] 

a [Expert Info (Note/Sequence): Keep-Alive] 
a [Timestamps] 


Figure 17-1 TCP keepalives are generated every 7s after the connection becomes idle. Each contains a below- 
window sequence number that is ACKed by the peer. A cable disconnection after 1 minute causes 
subsequent keepalives to not be ACKed. The client tries ten times before giving up and terminat¬ 
ing the connection. The termination is signaled to the server by the final reset segment (which the 
server cannot hear). This example also illustrates the use of DSACKs at the server and a spurious 
retransmission caused by the client delaying ACKs. 


the receiver's left window edge and is therefore a keepalive. Packef 2 confains an 
ACK number fhaf allows Wireshark fo process fhe sequence numbers in subse- 
quenf packefs appropriafely. 

Mosf of fhis connecfion consisfs of keepalives and corresponding ACKs. Pack¬ 
efs 1, 3, 5, 7, 14, 16, 18, 20, and 22-31 are all keepalives. Packefs 2, 4, 6, 8, 15, 17, 
19, and 21 are fhe corresponding ACKs. Keepalives are senf periodically every 7s 
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provided they are ACKed. When no ACK is returned for a keepalive, the sender 
switches to a Is interval for sending keepalives, according fo fhe defaulf value 
of KeepAlivelnterval. This sfarfs wifh packef 23 af fime 62.120. The sender 
produces fen unacknowledged keepalives in fofal (packefs 22-31). Affer fhaf, if 
ferminafes fhe connecfion, which resulfs in fhe final resef segmenf (packef 32) 
fhaf is never received by fhe disconnecfed receiver. The user receives fhe following 
oufpuf when fhe connecfion ferminafes: 


Write failed: Connection reset by peer 


This is a clear indicafion fhaf fhe connecfion has ferminafed, buf if is nof enfirely 
accurafe. If was really fhe sender fhaf ferminafed fhe connecfion, buf if did so 
based on fhe lack of response from fhe receiver. 

Aparf from fhe use of keepalive segmenfs, fhere are some ofher inferesfing 
feafures of fhis connecfion we will menfion briefly. Firsf, fhe server uses DSACKs 
(see Chapfer 14). Each ACK confains fhe sequence number range of fhe previously 
received in-window segmenf. Nexf, a small bif of dafa is exchanged af fime 26.09. 
The dafa represenfs a single key press. If is senf fo fhe server, ACKed by fhe server, 
and echoed back. The dafa is encrypfed, causing fhe packefs confaining dafa fo be 
48 byfes in user dafa size (see Chapfer 18). 

Inferesfingly, fhe echoed characfer is senf fwice. We can see fhaf packef 11, 
which confains fhe echoed characfer is nof ACKed immediafely. Recall from 
Chapfer 14 fhaf Linux uses an RTO of af leasf 200ms. Here we see fhaf fhe Linux 
server refransmifs fhe echoed characfer 200ms lafer, which produces an immedi- 
afe response from fhe clienf. Because fhis fesf was performed on an uncongesfed 
LAN, if is highly unlikely fhaf segmenf 11 was dropped. Insfead, if appears fhaf 
Linux produced a spurious refransmission due fo fhe clienf delaying ACKs. This 
is a similar sorf of hazard we saw when exploring fhe poor inferacfion befween 
fhe Nagle algorifhm and delayed ACKs we discussed in Chapfer 15. Here, fhe 
dynamic resulfs in an unnecessary delay of abouf 200ms. 

17.2.1.2 Other End Crashes and Reboots 

In fhis example we will see whaf happens when fhe peer crashes and reboofs. The 
inifial scenario is fhe same as fhe previous one, excepf fhis fime we sef KeepAl- 
iveTime fo 120,000 (2 minufes). We esfablish a connecfion and fhen waif jusf over 
2 minufes fo allow a keepalive message fo be senf and ACKed. Then we discon- 
necf fhe server from fhe nefwork, reboof if, and fhen reconnecf if. We expecf fhe 
nexf keepalive probe fo generafe a resef from fhe server, because fhe server now 
knows nofhing abouf fhis connecfion. Figure 17-2 presenfs fhe frace as displayed 
by Wireshark. 

In fhis example, fhe connecfion has been esfablished and small amounfs of 
dafa are exchanged sfarfing af seconds 0.00 and 3.46. Then fhe connecfion goes 
idle. Affer 2 minufes have elapsed (fhe keepalive time), fhe clienf sends fhe firsf 
keepalive probe af fime 123.47, confaining fhe "garbage" byfe below fhe receiver's 
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Q wnn7-to-(inux-keepalive-crash.td 

- Wireshark 




File Edit View Go 

Capture Analyze 

Statistics Telephony 

lools Help 


ei « « Si « 


eiQ.(D).n Bsmse Bi 

No. Time 

Source 

Destination 

Protocol 

Into 


1 0.000000 

10.0.1.37 

10.0.1.1 

TCP 

49179 > 22 [PSH, 

ACK] seq=l Ack=l win=65407 Len=48 

2 0.001042 

10.0.1.1 

10.0.1.37 

TCP 

22 > 49179 [PSH, 

ACK] Seq=l Ack=49 win=395 Len=48 

3 0.001662 

10.0.1.1 

10.0.1.37 

TCP 

22 > 49179 [PSH, 

ACK] seq=49 Ack=49 win=395 Len=144 

4 0.001706 

10.0.1.37 

10.0.1.1 

TCP 

49179 > 22 [ACK] 

Seq=49 Ack=193 Win=65215 Len=0 

5 0.002171 

10.0.1.1 

10.0.1.37 

TCP 

22 > 49179 [PSH, 

ACK] Seq=193 Ack=49 win=395 Len=80 

6 0.203675 

10.0.1.1 

10.0.1.37 

TCP 

[TCP Retransmission] 22 > 49179 [PSH, ack] seq=193 Ack=49 win=395 Len=80 

7 0.203679 

10.0.1.37 

10.0.1.1 

TCP 

49179 > 22 [ACK] 

seq=49 Ack=273 w1n=65l35 Len=0 SLE=193 SRE=273 

8 3.464192 

10.0.1.37 

10.0.1.1 

TCP 

49179 > 22 [PSH, 

ACK] Seq=49 Ack=273 Win=65135 Len=48 

9 3.465147 

10.0.1.1 

10.0.1.37 

TCP 

22 > 49179 [PSH, 

ACK] Seq=273 Ack=97 win=395 Len=48 

10 3.465810 

10.0.1.1 

10.0.1.37 

TCP 

22 > 49179 [PSH, 

ACK] Seq=321 Ack=97 win=395 Len=144 

11 3.465813 

10.0.1.37 

10.0.1.1 

TCP 

49179 > 22 [ACK] 

seq=97 Ack=465 win=64943 Len=0 

12 3.466378 

10.0.1.1 

10.0.1.37 

TCP 

22 > 49179 [PSH, 

ACK] Seq=465 Ack=97 Win=395 Len=80 

13 3.668328 

10.0.1.1 

10.0.1.37 

TCP 

[TCP Retransmission] 22 > 49179 [PSH, ACK] Seq=465 Ack»97 Win=395 Len«80 

15 123.470639 

10.0.1.37 

10.0.1.1 

TCP 

[TCP Keep-Alive] 

49179 > 22 [ACK] seq=96 Ack=545 win=64863 Len=l 

16 123.471095 

10.0.1.1, 

1,0.0.1.37 

TCP 

[TCP Keep-Alive ACK] 22 > 49179 [ACKJ Seq=545 Ack=97 Win=395 Len=0 SLE=96 

17 243.467925 

10.0.1.37 

10.0.1.1 

TCP 

[TCP Keep-Alive] 49179 > 22 [ACK] Seq=96 Ack=545 Win=64863 Len=l 

18 243.468406 

10.0.1.1 

10.0.1.37 

TCP 

22 > 49179 [RST] 

Seq=545 win=0 Len=0 

> 

la Frame 17: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) 

a Ethernet ii, Src: 00:30:67:18:da:58 (00:30:67:18:da:58), Dst: 00:04:5a:9f:9e:80 (00:04:5a:9f:9e:80) 

a internet protocol, src: 10.0.1.37 (10.0.1. 

37), DSt: 10.0.1.1 

(10.0.1.1) 

a Transmission Control Protocol, Src Port: 49179 (49179), Dst I 

^ort: 22 (22), Seq: 96, Ack: 545, Len: 1 


Source port: 49179 (49179) 

Destination port: 22 (22) 

[stream index: 0] 

Sequence number: 96 (relative sequence number) 

[Next sequence number: 97 (relative sequence number)] 
Acknowledgement number: 545 (relative ack number) 
Header length: 20 bytes 
a Flags: 0x10 (ACK) 
window size: 64863 
a Checksum: 0xf611 [correct] 
a [SEQ/ACK analysis] 

[Number of bytes in flight: 1] 

S [TCP Analysis Flags] 

S [This is a TCP keep-alive segment] 

B [Expert info (Note/sequence): Keep-Alive] 

[Message: Keep-Alive] 

[Severity level: Note] 

[Group: Sequence] 

B [Timestamps] 


Figure 17-2 The server has rebooted between keepalives sent by the client. The last keepalive elicits a reset 
segment because the server no longer knows anything about the connection. 

left window edge. It is acknowledged, and the server is disconnected, rebooted, 
and reconnected. At time 243.47, 120s later, the client sends its second keepalive 
probe. Although this reaches the server, the server no longer has any knowledge 
about the connection and responds with a reset segment (packet 18). This informs 
the client that the connection is no longer active, and the user is provided the same 
"Connection reset by peer" error message we saw before. 

17.2.1.3 Other End Is Unreachable 

In this case, the server has not crashed but becomes unreachable during the inter¬ 
val when the keepalive probes are sent. An intermediate router may have crashed, 
a phone line may be temporarily out of order, or something similar. To simulate 
this example we will use our sock program with the keepalive option set to 
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establish a connection to a Web server. We will use a Mac OS X client and an LDAP 
server (port 389) running on Idap. mit. edu. After shortening the client's keepalive 
time (for convenience) and opening fhe connecfion, we disconnecf fhe nefwork fo 
see fhe effecfs. Here are fhe command lines and oufpuf af fhe clienf: 


Mac# sysctl -w net.inet.tcp.keepidle=75000 
Mac% sock “K ldap.mit.edu 389 

recv error: Operation timed out about 14 minutes later 


The frace is displayed using Wireshark (see Figure 17-3). 


9^ keepalive-wan-disconnect.td - Wireshark 


File Edit View Go Capture Analyze Statistics Telephony lools Help 

No. Time Source Destination Protocol Info 


1 0.000000 IQ. 0.1. 33 18.7.22.128 TCP 12345 > 389 JjSYN] Seq=0 Wln=65535 Len=Q MS5=1460 WS=3 TSV=9513595| 


2 0.093133 18.7.22.128 10.0.1.33 TCP 389 > 12345 [SYN,_ACK] Seq-O Ack-l Win.5792 Len-O MSS.1460 SACK_pl 

3 0.093363 10.0.1.33 18.7.22.128 TCP 12345 > 389 [ACK] SEq=l Ack-l Win.524280 Len-0 TSV.951359566 TSEr| 



Figure 17-3 The WAN connection is taken down after the first keepalive probe is acknowledged. Another 
probe is sent every 75s. After nine keepalives are sent without a response, the connection is ter¬ 
minated and the client sends a reset to its peer. For the client, the situation is very similar to when 
the server crashes, as illustrated in Figure 17-1. 


In this figure we can see the entire connection. After the initial three-way 
handshake, the connection remains idle and a keepalive is sent and acknowledged 
at about time 75 (packet 4). This first keepalive is triggered by the value of the net 
. inet. tcp. keepidle variable. Shortly thereafter, the network is severed. Neither 
end of the connection produces data, so the next event is another keepalive sent by 
the client at time 150 (75s later, the value of the net.inet.tcp.keepintvl vari¬ 
able). This pattern repeats with packets 7-14, with no ACKs present, even though 
the server is up and running. Finally, the client gives up 75s after its ninth unac¬ 
knowledged keepalive probe. The connection termination is indicated to the server 
by a reset segment at the end (packet 15). Of course, the server is unable to receive 
this packet because the network is not operating. 

When a client TCF using keepalives is unable to communicate across the 
network with its peer, as this example shows, it retries some number of times 
before giving up. This is essentially the same behavior we saw when the other 
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end crashed. In most cases, the sending TCP cannot tell the difference. There are 
some excepfions, such as when ICMP indicafes fhaf fhe desfinafion has become 
unreachable or ofherwise unavailable because of problems in fhe nefwork, buf 
fhese condifions are relafively rare because ICMP is offen blocked. As a resulf, 
mechanisms such as TCP keepalive (or similar mechanisms implemenfed by 
applicafions) are used fo defecf disconnecfion periods. 


17.3 Attacks Involving TCP Keepalives 

As we menfioned before, ssh (version 2) has an applicafion-level form of keep¬ 
alive called server alive messages and client alive messages. These are differenf from 
TCP keepalive messages because fhey are senf over an encrypfed channel af fhe 
applicafion layer and confain dafa. TCP keepalives confain no user-level dafa, so 
fhe use of encrypfion is limifed af besf. The consequence is fhaf TCP keepalives 
may be spoofed. When TCP keepalives are spoofed, fhe vicfim can be coerced info 
keeping resources allocafed for a period longer fhan infended. 

Alfhough if may be a relafively minor concern, TCP keepalives are driven off a 
fimer based on fhe various configurafion paramefers discussed earlier, and nof off 
fhe dynamically adjusfed refransmission fimer used fo refransmif segmenfs wifh 
dafa. A passive observer could nofice fhe exisfence of keepalives and fheir infer- 
arrival fimes fo conceivably learn informafion abouf fhe configurafion paramefers 
(possibly idenfifying fhe fype of sending sysfem, called fingerprinting) or abouf fhe 
nefwork fopology (i.e., whefher downsfream roufers are forwarding fraffic or nof). 
These issues could be of concern in some environmenfs. 


17.4 Summary 

As we said earlier, fhe keepalive feafure has been somewhaf confroversial. Profo- 
col experfs confinue fo debafe whefher if belongs in fhe fransporf layer or should 
be handled enfirely by fhe applicafion. All popular TCP implemenfafions now 
include fhe keepalive feafure, which applicafions may opfionally use fo esfablish 
a "hearfbeaf" of fraffic moving across a connecfion. Doing so can help a server by 
allowing if fo defecf nonresponsive clienfs and can help clienfs by keeping con- 
necfions acfive (e.g., fo keep NAT sfafe acfive) even if no applicafion-layer dafa is 
flowing. 

Keepalives operafe by sending a probe packef (usually confaining a "garbage" 
byfe, alfhough zero-lengfh probes are also possible) across a connecfion affer fhe 
connecfion has been idle for some relafively long period of fime, offen 2 hours. 
Four differenf scenarios can occur: fhe ofher end is sfill fhere, fhe ofher end has 
crashed, fhe ofher end has crashed and reboofed, or fhe ofher end is currenfly 
unreachable. We saw each of fhese scenarios wifh an example. 
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In the first two keepalive examples that we examined, had keepalives not 
been used, and without any application-level timer or activity, TCP would never 
have known that the other end had crashed (or crashed and rebooted). In the final 
example, however, nothing was wrong with the other end; the connection was 
temporarily down. We must be aware of this limitation when using keepalives 
and consider whether or not such behavior is desired. 

Attacks against the keepalive mechanism include causing a system to keep 
resources allocated longer than intended and possibly learning some otherwise 
hidden information about the end systems (although such information may be 
of limited use to an attacker). In addition, by default TCP does not use its own 
encryption, so keepalives and keepalive ACKs can be spoofed, whereas applica¬ 
tion-level keepalives that employ encryption (e.g., ssh) cannot. 
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Security: EAP, IPsec, TLS, 
DNSSEC, and DKIM 


18.1 Introduction 

In this chapter we will take a look at several forms of securify used wifh TCP/IP. 
Securify is a very broad and inferesfing topic, and covering if comprehensively is 
far beyond fhe scope of fhis book. Consequenfly, we will be inferesfed fo know 
abouf fhe various fypes of securify fhreafs on fhe Infernef, and we will delve info 
some defail on fhose securify mechanisms aimed af counfering fhem fhaf are 
applicable fo fhe operafion of various protocols such as IP, TCP, and fhe imporfanf 
e-mail and DNS applicafion protocols. 

Alfhough our par f if toning is nof really formal, securify fhreafs can be bro¬ 
ken down info affacks fhaf fargef implemenfafion problems by frying fo subverf 
processes info running code fhaf was nof infended, frying fo gef users fo run pro¬ 
grams fhaf do bad fhings, and using nefwork protocols in complianf buf unau- 
fhorized ways. We have already seen forms of fhese affacks in ofher chapters. For 
example, one of fhe earliesf worms (self-propagafing soffware) on fhe Infernef used 
a buffer overflow fhaf overwrifes fhe server process's memory. Doing so allows a cli- 
enf program fo injecf soffware info a server fhaf ulfimafely runs fhis injecfed code. 
The injecfed code fhen performs fhe same acfion, fhereby causing fhe program fo 
self-propagafe. Nafurally, such code could perform more malicious acfivifies fhan 
simply self-propagafion. 

The various fypes of affacks and fechniques can be combined, and compli- 
cafed soffware and securify analysis fools have been developed as fhe value of fhe 
informafion on fhe Infernef has increased. A variefy of fexfs, including [MSK09], 
discuss fhe fools and fechniques in more defail. Today, essenfially any soffware 
executed by a user or as a user againsf fhe user's infenfions is known by fhe gen¬ 
eral ferm malware, shorf for "malicious soffware." Enfire indusfries have been 
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developed to both create and reduce the effects of malware. Malware can be deliv¬ 
ered in e-mail messages or attachments (e.g., in spam), picked up while visiting 
a Web site {drive-by attacks), or acquired when using portable media such as por¬ 
table USB drives. 

In some cases, malware is used to take control of a large number of computers 
in the Internet (botnets). Botnets are controlled by individuals or organizations {bot 
herders) and can be used on a wide scale for a number of purposes such as send¬ 
ing spam, compromising other computers, exfiltrating information from the com¬ 
promised system (e.g., credit card and bank account information, and the user's 
logged keystrokes), and launching DoS attacks by sending a large aggregate vol¬ 
ume of Internet traffic to one or more victims. Botnets are now commonly offered 
as a service on a rental basis—a client can hire a bot herder to perform one or more 
nefarious tasks. One common task is to generate e-mails in hopes of inducing the 
recipient(s) to visit a particular Web site or purchase a particular product {phish¬ 
ing). When a specific victim is targeted in this way, the activity is usually called 
spear phishing. 

Our interest is in understanding how secure communication protocols on the 
Internet work. Ironically, perhaps, many worms or viruses implement secure com¬ 
munication protocols. In most cases, we will see how the types of protocols we 
have already studied such as IP, TCP, e-mail, and DNS have been augmented with 
security extensions (sometimes in the form of additional protocols) to enhance 
security. We need to be somewhat precise in defining what "security" means in 
terms of a communication protocol, in order to understand if the techniques avail¬ 
able to us are sufficient to provide our desired level of protection. Therefore, we 
shall begin by studying the properties of information protection considered desir¬ 
able in the field of information security. 


18.2 Basic Principles of Information Security 

There are three primary properties of information, whether in a computer net¬ 
work or not, that may be desirable from an information security point of view: 
confidentiality, integrity, and availability (the CIA triad) [LOl], summarized here: 

• Confidentiality means that information is made known only to its intended 
users (which could include processing systems). 

• Integrity means that information has not been modified in an unauthor¬ 
ized way before it is delivered. 

• Availability means that information is available when needed. 

These are core properties of information, yet there are other properties we 
may also desire, including authentication, nonrepudiation, and auditability. Authen¬ 
tication means that a particular identified party or principal is not impersonating 
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another principal. Nonrepudiation means that if some action is performed by a 
principal (e.g., agreeing to the terms of a contract), this fact can be proven later (i.e., 
cannot successfully be denied). Auditability means that some sort of trustworthy 
log or accounting describing how information has been used is available. Such 
logs are often important for forensic (i.e., legal and prosecuritorial) purposes. 

These principles are applicable to information in physical (e.g., printed) form, 
for which mechanisms such as safes, secured facilities, and guards have been used 
for thousands of years to enforce controlled sharing, storage, and dissemination. 
When information is to be moved through an unsecured environment, additional 
techniques are required. To see why, let us examine the types of threats to which 
information can be exposed when it travels through an unsecured communica¬ 
tion channel. 


18.3 Threats to Network Communication 

When considering the design and operation of network protocols, ensuring that 
information has the desired properties of integrity, availability, and confidentiality 
can be quite a challenge because of the wide range of possible attacks that can be 
carried out in an otherwise uncontrolled network such as the Internet. Attacks can 
generally be categorized as either passive or active [VK83]. Identifying the category 
is useful because different techniques are required to provide security depending 
on the particular category. Passive attacks are mounted by monitoring or eaves¬ 
dropping on the contents of network traffic, and if not handled they can lead to 
unauthorized release of information (loss of confidentiality). Active attacks can 
cause modification of information (with possible loss of integrity) or denial of ser¬ 
vice (loss of availability). Logically, such attacks are carried out by an "intruder" or 
adversary. This is often depicted using the scenario shown in Figure 18-1. 






9 • 

Alice Bob 


Figure 18-1 The principals, Alice and Bob, attempt to communicate securely, but Eve may eavesdrop 
and Mallory may modify messages in transit. 
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The figure depicts the principals, Alice and Bob, trying to communicate. How¬ 
ever, there are two attackers. Eve and Mallory. Eve (eavesdropper) is only able to 
monitor the traffic exchanged between Alice and Bob and thus can carry out only 
passive attacks. Mallory (malicious attacker) can store, modify, and replay traffic 
passing between Alice and Bob, so she can carry out active and passive attacks. 
Table 18-1 summarizes the major categories of passive and active attacks that Alice 
and Bob may face. 


Table 18-1 Attacks on communication are broadly classified as passive or active. Passive attacks are ordinarily 
more difficult to detect, and active attacks are ordinarily more difficult to prevent. 


Passive 

Active 

Type 

Threats 

Type 

Threats 

Eavesdropping 

Confidentiality 

Message stream 
modification 

Authenticity, integrity 

Traffic analysis 

Confidentiality 

Denial of service (DoS) 

Availability 



Spurious association 

Authenticity 


Erom an attacker's perspective. Table 18-1 gives a quick summary of the pas¬ 
sive attacks available to Eve and the active (and passive) attacks available to Mal¬ 
lory. Eve is able to eavesdrop (listen in on, also called capture or sniff) and perform 
traffic analysis on the traffic passing between Alice and Bob. Capturing the traffic 
could lead to compromise of confidentiality, as sensitive data may be available to 
Eve without Alice or Bob knowing. In addition, traffic analysis can determine the 
features of the traffic, such as its size and when it is sent, and possibly identify the 
parties to a communication. This information, although it does not reveal the exact 
contents of the communication, could also lead to disclosure of sensitive informa¬ 
tion and could be used to mount more powerful active attacks in the future. 

While the passive attacks are essentially impossible for Alice or Bob to detect, 
Mallory is capable of performing more easily noticed active attacks. These include 
message stream modification (MSM), denial-of-service (DoS), and spurious associa¬ 
tion attacks. MSM attacks (including so-called called man-in-the-middle or MITM 
attacks) are a broad category and include any way traffic is modified in transit, 
including deletion, reordering, and content modification. DoS might include dele¬ 
tion of traffic, or generation of such large volumes of traffic so as to overwhelm 
Alice, Bob, or the communication channel connecting them. Spurious associations 
include masquerading (Mallory pretends to be Bob or Alice) and replay, whereby 
Alice or Bob's earlier (authentic) communications are replayed later, from Mal¬ 
lory's memory. 

Two major methods are available to prevent the passive and active attacks 
we have just described. One method would be to ensure through physical secu¬ 
rity that only trusted parties have access to the communication infrastructure 
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connecting Alice and Bob. This approach is used in limited circumstances but is 
effectively impractical for any nefwork spanning a large geographical disfance. Of 
course, if fhe communicafion channel is wireless, securing if using only physical 
mefhods is effecfively impossible. Given fhese considerafions, some mechanism is 
needed fo allow informafion fo pass fhrough unsecured communicafion channels 
in such a way fhaf adversaries like Eve and Mallory are, for fhe mosf parf af leasf, 
fhwarfed. This mechanism is cryptography. Wifh effecfive and careful use of cryp¬ 
tography, passive affacks are rendered ineffecfive, and acfive affacks are made 
defecfable (and fo some degree prevenfable). 


18.4 Basic Cryptography and Security Mechanisms 

Cryptography evolved from fhe desire fo profecf fhe confidenfialify, infegrify, and 
aufhenficify of informafion carried fhrough unsecured communicafion channels. 
Such a capabilify is clearly of significanf imporfance in profecfing confidenfial 
informafion such as milifary orders, infelligence, and recipes for creafing espe¬ 
cially dangerous or valuable maferials. The use of cryptography, af leasf in a prim- 
ifive form, dafes back fo af leasf 3500 BCE. The earliesf sysfems were usually codes. 
Codes involve subsfifufions of groups of words, phrases, or senfences wifh groups 
of numbers or leffers as given in a codebook. Codebooks needed fo be kepf secref 
in order fo keep communicafions privafe, so disfribufing fhem required consider¬ 
able care. 

More advanced sysfems used ciphers, in which bofh subsfifufion and rear- 
rangemenf are used. Several codes were used in fhe Middle Ages, and by fhe 
lafe 1800s large code and cipher sysfems were commonly use for diplomafic and 
milifary communicafions. By fhe early fwenfiefh cenfury, cryptography was well 
esfablished buf would nof fake ifs major leap forward unfil World War II. Dur¬ 
ing fhis period, elecfromechanical cryptographic machines such as fhe German 
ENIGMA and Lorenz machines posed a challenge fo Allied cryptanalysts (code 
breakers). One of fhe firsf digifal compufers. Colossus, was developed by fhe 
Brifish fo decipher Lorenz-enciphered messages. A funcfioning Colossus Mark 
2 machine was created in 2007, after a 14-year efforf, by Tony Sale of fhe Nafional 
Museum of Compufing af Blefchley Park, UK [TNMOC]. 

18.4.1 Cryptosystems 

While the historical basis for cryptography is primarily for preserving confiden¬ 
tiality, other desirable properties such as integrity and authentication can also 
be achieved using cryptographic and related mathematical techniques. To help 
understand the basics, Eigure 18-2 illustrates how the two most important types of 
cryptographic algorithms, called symmetric key and public (asymmetric) key ciphers, 
work. 
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Symmetric Key Cryptosystem 



Asymmetric (Public Key) Cryptosystem 

Figure 18-2 The unencrypted (cleartext) message is passed through an encryption algorithm to 
produce an encrypted (ciphertext) message. In a symmetric cryptosystem, the same 
(secret) key is used for encryption and decryption. In an asymmetric or public key 
cryptosystem, confidentiality is achieved by using the recipient's public key for encryp¬ 
tion and private (secret) key for decryption. 


This figure shows the high-level operation of symmetric and asymmetric key 
cryptography. In each case, a cleartext message is processed by an encryption algo¬ 
rithm to produce ciphertext (scrambled text). The key is a particular sequence of 
bits used to drive the encryption algorithm or cipher. With different keys, the same 
input produces different outputs. Combining the algorithms with supporting pro¬ 
tocols and operating methods forms a cryptosystem. In a symmetric cryptosystem, 
the encryption and decryption keys are typically identical, as are the encryption 
and decryption algorithms. In an asymmetric cryptosystem, each principal is gen¬ 
erally provided with a pair of keys consisting of one public and one private key. 
The public key is intended to be known to any party that might want to send a 
message to the key pair's owner. The public and private keys are mathematically 
related and are themselves outputs of a key generation algorithm. One of the major 
benefits of asymmetric key cryptosystems is that secret key material does not have 
to be securely distributed to every party that wishes to communicate. 
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Without knowing the symmetric key (in a symmetric cryptosystem) or the 
private key (in a public key cryptosystem), it is (believed to be) effectively impossi¬ 
ble for any fhird parfy fhaf infercepfs fhe cipherfexf fo produce fhe corresponding 
clearfexf. This provides fhe basis for confidenfialify. For fhe symmefric key cryp- 
fosysfem, if also provides a degree of aufhenficafion, because only a parfy holding 
fhe key is able fo produce a useful cipherfexf fhaf can be decrypfed fo somefhing 
sensible. A receiver can decrypf fhe cipherfexf, look for a porfion of fhe resulfing 
clearfexf fo confain a parficular agreed-upon value, and conclude fhaf fhe sender 
holds fhe appropriafe key and is fherefore aufhenfic. Furfhermore, mosf encryp- 
fion algorifhms work in such a way fhaf if messages are modified in fransif, fhey 
are unable fo produce useful clearfexf upon decrypfion. Thus, symmefric cryp- 
fosysfems provide a measure of bofh aufhenficafion and infegrify profecfion for 
messages, buf fhis approach alone is weak. Insfead, special forms of checksums 
are usually coupled wifh symmefric crypfography fo ensure infegrify. We discuss 
fhese lafer, affer fhe crypfographic preliminaries. 

A symmefric encrypfion algorifhm is usually classified as eifher a block cipher 
or a stream cipher. Block ciphers perform operafions on a fixed number of bifs (e.g., 
64 or 128) af a f ime, and sf ream ciphers operafe conf inuously on however many bifs 
(or byfes) are provided as inpuf. For years, fhe mosf popular symmefric encrypfion 
algorifhm was fhe Data Encryption Standard (DBS), a block cipher fhaf uses 64-bif 
blocks and 56-bif keys. Evenfually, fhe use of 56-bif keys was felf fo be insecure, 
and many applicafions fumed fo triple-DES (also denofed 3DES or TDES—apply¬ 
ing DBS fhree fimes wifh fwo or fhree differenf keys fo each block of dafa). Today, 
DBS and 3DES have been largely phased ouf in favor of fhe Advanced Encrypfion 
Sfandard (AES) [EIPS197], also known occasionally by ifs original name fhe Rijn- 
dael algorithm (pronounced "rain-dahl"), in deference fo ifs Belgian crypfographer 
inventors Vincenf Rijmen and Joan Daemon. Differenf varianfs of AES provide key 
lengfhs of 128, 192, and 256 bifs and are usually wriffen wifh fhe corresponding 
extension (i.e., AES-128, AES-192, and AES-256). 

Asymmefric crypfosysfems have some addifional inferesfing properfies 
beyond fhose of symmefric key crypfosysfems. Assuming we have Alice as sender 
and Bob as intended recipienf, any fhird parfy is assumed fo know Bob's public 
key and can fherefore send him a secref message—only Bob is able fo decrypf if 
because only Bob knows fhe privafe key corresponding fo his public key. How¬ 
ever, Bob has no real assurance fhaf fhe message is aufhenfic, because any parfy 
can create a message and send if fo Bob, encrypted in Bob's public key. Eorfunafely, 
public key crypfosysfems also provide anofher funcfion when used in reverse: 
aufhenficafion of fhe sender. In fhis case, Alice can encrypf a message using her 
privafe key and send if fo Bob (or anyone else). Using Alice's public key (known fo 
all), anyone can verify fhaf fhe message was aufhored by Alice and has nof been 
modified. However, if is nof confidenfial because everyone has access fo Alice's 
public key. To achieve aufhenficify, infegrify, and confidenfialify, Alice can encrypf 
a message using her privafe key and encrypf fhe resulf using Bob's public key. The 
resulf is a message fhaf is reliably aufhored by Alice and is also confidenfial fo 
Bob. This process is illusfrafed in Eigure 18-3. 
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Have Been Sent by Anyone) 

Asymmetric (Public Key) Cryptosystem 

Figure 18-3 The asymmetric cryptosystem can be used for confidentiality (encryption), authentica¬ 
tion (digital signatures or signing), or both. When used for both, it produces a signed 
output that is confidential to the sender and the receiver. Public keys, as their name 
suggests, are not kept secret. 


When public key cryptography is used in "reverse" like this, it provides a 
digital signature. Digital signatures are important consequences of public key cryp¬ 
tography and can be used to help ensure authenticity and nonrepudiation. Only a 
party possessing Alice's private key is able to author messages or carry out trans¬ 
actions as Alice. 

In a hybrid cryptosystem, elements of both public key and symmetric key 
cryptography are used. Most often, public key operations are used to exchange a 
randomly generated confidential (symmetric) session key, which is used to encrypt 
traffic for a single transaction using a symmetric algorithm. The reason for doing 
so is performance—symmetric key operations are less computationally intensive 
than public key operations. Most systems today are of the hybrid type: public key 
cryptography is used to establish keys used for symmetric encryption of indi¬ 
vidual sessions. 


18.4.2 Rivest, Shamir, and Adleman (RSA) Public Key Cryptography 

We have seen how public key cryptography can be used for both digital signatures 
and confidentiality. The most common approach is called RSA in deference to its 
authors' names, Rivest, Shamir, and Adleman [RSA78]. The security of this sys¬ 
tem hinges on the difficulty of factoring large numbers into constituent primes. 
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To initialize RSA, two large prime numbers p and q are generated, which usually 
involves checking a number of large odd numbers that are randomly generated 
until two primes are found. The producf of fhese primes n = pq\s called fhe modu¬ 
lus. The lengfh of n, p, and q is usually measured in bifs, wifh n offen being 1024 
bifs and fhe ofhers being abouf 512, alfhough larger sizes such as 2048 are now rec¬ 
ommended. The value <I>(z;) is known in number fheory as fhe Euler totient of fhe 
infeger v. If gives fhe number of posifive infegers less fhan v fhaf are also coprime 
fo V (i.e., whose greafesf common divisor is 1). Because of fhe way n is consfrucfed 
for RSA, 0{n) = {q - l)(p - 1). 

Using fhe defnifion for <I>(n), we can choose fhe RSA public exponenf (called 
e for "encrypfion") and derive a privafe exponenf (called d for "decrypfion") as 
mulfiplicafive inverses using fhe relafion d = e ' (mod 0(n)). In pracfice, e is offen 
some value wifh a fairly small populafion counf (i.e., has a small number of 1 bifs) 
such as 65,537 (10000000000000001 binary), for fasfer compufafions. To form an 
encrypfed cipherfexf c from a clearfexf message m, fhe value c = m‘ (mod n) is com- 
pufed. To form fhe value m from c, decrypfion is performed: m = c'^ (mod n). An 
RSA public key consisfs of fhe public exponenf e and modulus n. The correspond¬ 
ing privafe key consisfs of fhe privafe exponenf d and fhe modulus n. 

As suggesfed earlier, public key algorifhms such as RSA can also be used fo 
produce digifal signafures by essenfially running RSA "in reverse." To creafe an 
RSA signafure of a message m, fhe value s = (mod n) can be produced as a 
signed version of m. Anyone receiving fhe value s can apply fhe public exponenf 
e fo produce m = s‘ (mod n), which provides fhe basis for verifying fhaf whafever 
produced fhe value s was in possession of fhe privafe value d (ofherwise fhe value 
m produced would nof be sensible). 

The securify of RSA is based on fhe difficulfy of factoring large numbers. In 
fhe confexf of RSA and our scenario of Figure 18-1, Eve is able fo obfain n and e buf 
does nof know p, q, or <I)(n). If she could defermine any of fhese lasf fhree values, 
if would be frivial fo defermine d using fhe relafion we have described. However, 
doing so appears fo involve factoring n, and factoring numbers of 1000 or more bifs 
is currenfly believed fo be ouf of reach for even fhe besf facforizafion algorifhms. 
Indeed, factoring semiprimes (numbers fhaf are a producf of fwo primes) appears 
fo represenf fhe mosf difficulf case for such algorifhms. 

18.4.3 Diffie-Hellman-Merkle Key Agreement (aka Diffie-Hellman or DH) 

A common requiremenf in securify protocols is fo have fwo parfies agree on a 
common sef of secref bifs fhaf can be used as a symmefric key. Doing so in a nef- 
work fhaf may confain eavesdroppers (such as Eve) is a challenge, because if is nof 
immediately obvious how fo have fwo principals (such as Alice and Bob) agree 
on a common secref number wifhouf Eve knowing. The Diffie-Hellman-Merkle Key 
Agreement protocol (more commonly called simply Diffie-Hellman or DH) provides 
a mefhod for accomplishing fhis fask, based on fhe use of finite field arifhmefic 
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[DH76].* DH techniques are used in many of the Internet-related security proto¬ 
cols [RFC2631] and are closely related to the RSA approach for public key cryptog¬ 
raphy We shall have a brief look at how they work. 

With the same cast of characters (Alice, Bob, etc.), let us assume that all parties 
are aware of two integers p and g. Let p be a (large) prime number and y < p be a 
primitive root mod p. With these assumptions, every integer in the group Z^ = {1, 
..., p -1} can be generated by raising g to some power. Said another way, for every 
n, there exists some k for which ^ = n (mod p). Finding the value (or values) of k 
given g, n, and p (called the discrete log problem) is considered to be difficult, result¬ 
ing in the belief that DH is secure. Finding the value of n given g, k, and p is easy, 
resulting in the approach being practical. 

For Alice and Bob to establish a shared secret key, they can use the following 
protocol: Alice chooses a secret random number a and computes A = g“ (mod p), 
which she sends to Bob. Bob chooses a secret random number b and computes B = 
g* (mod p), which he sends to Alice. Alice and Bob arrive at the same shared secret 
K= g"’’ (mod p). Alice computes this value this way: 

K = B“ (mod p) = g*‘‘ (mod p) 
and Bob computes it this way: 

K = A*" (mod p) = (mod p) 

Given that g’’“ is equal to g“^ (because is so-called power associative and we 
assumed all parties are aware of the group being used), both Alice and Bob 
know K. Note that Eve has access only to g, p. A, and B so cannot determine K 
without solving the discrete log problem [MW99]. However, this basic protocol 
is vulnerable to an attack from Mallory. Mallory can pretend to be Bob when 
communicating with Alice and vice versa by supplying her own A and B values. 
However, the basic DH protocol can be extended to protect from this man-in-the- 
middle attack if the public values for A and B are authenticated [DOW92]. The 
classic approach, called the Station-to-Station protocol (STS), involves Alice and Bob 
signing their public values. 

18.4.4 Signcryption and Elliptic Curve Cryptography (ECC) 

When using RSA, additional security is provided with larger numbers. However, 
the basic mathematical operations required by RSA (e.g., exponentiation) can be 
computationally intensive and scale as the numbers grow. Reducing the effort of 
combining digital signatures and encryption for confidentiality, a class of sign¬ 
cryption schemes [Z97] (also called authenticated encryption) provides both features 


1. The technique was described in a then-classified reference in 1973 by C. Cocks, "A Note on 'Non- 
Secret Encryption.'" See http://www.cesg.gov.uk/publications/media/notense.pdf. 
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at a cost less than the sum of the two if compufed separafely. However, even 
greafer efficiency can somefimes be achieved by changing fhe mafhemafical basis 
for public key cryptography. 

In a confinuing search for securify wifh greafer efficiency and performance, 
researchers have explored ofher public key cryptosystems beyond RSA. An alfer- 
nafive based on fhe difficulfy of finding fhe discrefe logarifhm of an elliptic curve 
elemenf has emerged, known as elliptic curve cryptography (ECC, nof to be con¬ 
fused wifh error-correcfing code) [M85][K87][RFC5753]. For equivalenf securify, 
ECC offers fhe benefif of using keys fhaf are considerably smaller fhan fhose of 
RSA (e.g., by abouf a factor of 6 for a 1024-bif RSA modulus). This leads to sim¬ 
pler and faster implemenfafions, issues of considerable pracfical concern. ECC has 
been sfandardized for use in many of fhe applicafions where RSA sfill refains 
dominance, buf adopfion has remained somewhaf sluggish because of pafenfs on 
ECC fechnology held by fhe Cerficom Corporafion. (The RSA algorifhm was also 
pafenfed, buf pafenf profecfion lapsed in fhe year 2000.) 

18.4.5 Key Derivation and Perfect Forward Secrecy (PFS) 

In communication scenarios where multiple messages are to be exchanged, it is 
common to establish a short-term session key to perform symmetric encryption. 
The session key is ordinarily a random number (see the following section) gener¬ 
ated by a function called a key derivation function (KDF), based on some input such 
as a master key or a previous session key. If a session key is compromised, any of 
the data encrypted with the key is subject to compromise. However, it is common 
practice to change keys {rekey) multiple times during an extended communication 
session. A scheme in which the compromise of one session key keeps future com¬ 
munications secure is said to have perfect forward secrecy (PFS). Usually, schemes 
that provide PFS require additional key exchanges or verifications that introduce 
overhead. One example is the STS protocol for DH mentioned earlier. 

18.4.6 Pseudorandom Numbers, Generators, and Function Families 

In cryptography, random numbers are often used as initial input values to cryp¬ 
tographic functions, or for generating keys that are difficult to guess. Given that 
computers are not very random by nature, obtaining true random numbers is 
somewhat difficult. The numbers used in most computers for simulating random¬ 
ness are called pseudorandom numbers. Such numbers are not usually truly random 
but instead exhibit a number of statistical properties that suggest that they are 
(e.g., when many of them are generated, they tend to be uniformly distributed 
across some range). 

Pseudorandom numbers are produced by an algorithm or device known as a 
pseudorandom number generator (PRNG) or pseudorandom generator (PRG), depending 
on the author. Simple PRNGs are deterministic. That is, they have a small amount 
of internal state initialized by a seed value. Once the internal state is known, the 
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sequence of PNs can be determined. For example, the common Linear Congruential 
Generator (LCG) algorithm produces random-appearing values that are entirely 
predictable if the input parameters are known or guessed. Consequently, LCGs 
are perfectly fine for use in certain programs (e.g., games that simulate random 
events) but insufficient for cryptographic purposes. 

A pseudorandom function family (FRF) is a family of functions that appear to 
be algorithmically indistinguishable (by polynomial time algorithms) from truly 
random functions [GGM86]. A PRF is a stronger concept than a PRG, as a PRG can 
be created from a PRF. PRFs are the basis for cryptographically strong (or secure) 
pseudorandom number generators, called CSPRNGs. CSPRNGs are necessary in 
cryptographic applications for several purposes, including session key generation, 
for which a sufficient amount of randomness must be guaranteed [RFC4086]. 

18.4.7 Nonces and Salt 

A cryptographic nonce is a number that is used once (or for one transaction) in a 
cryptographic protocol. Most commonly, a nonce is a random or pseudorandom 
number that is used in authentication protocols to ensure freshness. Freshness is 
the (desirable) property that a message or operation has taken place in the very 
recent past. For example, in a challenge-response protocol, a server may provide a 
requesting client with a nonce, and the client may need to respond with authenti¬ 
cation material as well as a copy of the nonce (or perhaps an encrypted copy of the 
nonce) within a certain period of time. This helps to avoid replay attacks, because 
old authentication exchanges that are replayed to the server would not contain the 
correct nonce value. 

A salt or salt value, used in the cryptographic context, is a random or pseudo¬ 
random number used to frustrate brute-force attacks on secrets. Brute-force attacks 
usually involve repeatedly guessing a password, passphrase, key, or equivalent 
secret value and checking to see if the guess was correct. Salts work by frustrat¬ 
ing the checking portion of a brute-force attack. The best-known example is the 
way passwords used to be handled in the UNIX system. Users' passwords were 
encrypted and stored in a password file that all users could read. When logging 
in, each user would provide a password that was used to double encrypt a fixed 
value. The result was then compared against the user's entry in the password file. 
A match indicated that a correct password was provided. 

At the time, the encryption method (DBS) was well known and there was 
concern that a hardware-based dictionary attack would be possible whereby many 
words from a dictionary were encrypted with DBS ahead of time (forming a rain¬ 
bow table) and compared against the password file. A pseudorandom 12-bit salt 
was added to perturb the DBS algorithm in one of 4096 (nonstandard) ways for 
each password in an effort to thwart this attack. Ultimately, the 12-bit salt was 
determined to be insufficient with improved computers (that could guess more 
values) and was expanded. 
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18.4.8 Cryptographic Hash Functions and Message Digests 

In most of the protocols we have studied, including Ethernet, IP, ICMP, UDP, and 
TCP, we have seen the use of a frame check sequence (PCS, either a checksum or 
a CRC) to determine whether a PDU has likely been delivered without bit errors. 
Such mathematical functions tend to trade off the likelihood of detecting random 
errors against the amount of overhead required to carry the PCS value. When 
considering security, however, we are interested in ensuring message integrity 
not only against random, infrequent errors, but also against intentional message 
stream modification attacks. We are worried about Mallory modifying messages 
as they travel through the network. Ordinary PCS functions are not sufficient for 
this purpose. 

A checksum or PCS can be used to verify message integrity against an adver¬ 
sary like Mallory if properly constructed using special functions. Such functions 
are called cryptographic hash functions and often resemble portions of encryption 
algorithms. The output of a cryptographic hash function H, when provided a mes¬ 
sage M, is called the digest or fingerprint of the message, H(M). A message digest 
is a type of strong PCS that is easy to compute and has the following important 
properties: 

• Preimage resistance: Given H(M), it should be difficult to determine M if 
not already known. 

• Second preimage resistance: Given H(M1), it should be difficult to deter¬ 
mine an M2 ^ Ml such that H(M1) = H(M2). 

• Collision resistance: It should be difficult to find any pair Ml, M2 where 
H(M1) = H(M2) when M2 ^ Ml. 

If a hash function has all of these properties, then if two messages have the 
same cryptographic hash value, they are, with negligible doubt, the same mes¬ 
sage. The two most common cryptographic hash algorithms are at present the 
Message Digest Algorithm 5 (MD5, [RPC1321]), which produces a 128-bit (16-byte) 
digest, and the Secure Hash Algorithm 1 (SHA-1), which produces a 160-bit (20-byte) 
digest. More recently, a family of functions based on SHA called SHA-2 [RPC6234] 
produce digests with lengths of 224, 256, 384, or 512 bits (28, 32, 48, and 64 bytes, 
respectively). Others are under development. 


Notes 

Cryptographic hash functions are often based on a compression function f, which 
takes an input of length L and produces a collision-resistant but deterministic 
output of size less than L. The Merkte-Damgard construction, which essentially 
breaks an arbitrarily long Input Into blocks of length L, pads them, passes them to 
f, and combines the results, produces a cryptographic hash function capable of 
taking a long Input and producing an output with collision resistance. 
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MD5 had been in widespread use with internet protocols untii it was reported 
broken in 2005 (i.e., two different 128-byte sequences were shown to have the 
same IVID5 vaiue) [WY05]. SHA-1 was used as an alternative, but it was also 
thought to possibly have weaknesses, so a SHA-2 family of algorithms was devei- 
oped. Given SHA-2’s similarity to SHA-1, there is concern that it, too, may have 
weaknesses. In December 2010, the National Institute of Standards and Technoi- 
ogy (NIST) in the United States announced that five aigorithms had been seiected 
as final candidates for a new “SHA-3” cryptographic hash algorithm [CHP]. The 
seiection of the finai winning aigorithm is scheduled for sometime after spring 
2012 . 


18.4.9 Message Authentication Codes (MACs, HMAC, CMAC, and GMAC) 

A message authentication code (unfortunately abbreviated MAC or sometimes 
MIC but unrelated to the link-layer MAC addresses we discussed in Chapter 3) 
can be used to ensure message integrity and authentication. MACs are usually 
based on keyed cryptographic hash functions. Such functions are like message 
digest algorithms (see Section 18.4.8) but require a private key to produce or verify 
the integrity of a message and may also be used to verify (authenticate) the mes¬ 
sage's sender. 

MACs require resistance to various forms of forgery. For a given keyed hash 
function H(M,K) taking input message M and key K, resistance to selective forgery 
means that it is difficult for an adversary not knowing K to form H(M,K) given a 
specific M. H(M,K) is resistant to existential forgery if it is difficult for an adversary 
lacking K to find any previously unknown valid combination of M and H(M,K). 
Note that MACs do not provide exactly the same features as digital signatures. For 
example, they cannot be a solid basis for nonrepudiation because the secret key is 
known to more than one party. 

A standard MAC that uses cryptographic hash functions in a particular way 
is called the keyed-hash message authentication code (HMAC) [FIPS198][RFC2104]. 
The HMAC "algorithm" uses a generic cryptographic hash algorithm, say H(M). 
To form a t-byte HMAC on message M with key K using H (called HMAC-H), we 
use the following definition: 

HMAC-H {K, M)' = Aj {H{{K © opad) I \H{{K © ipad) IIM))) 

In this definition, opad (outer pad) is an array containing the value 0x5C 
repeated IXI times, and ipad (inner pad) is an array containing the value 0x36 
repeated IXI times. © is the vector XOR operator, and II is the concatenation oper¬ 
ator. Normally the HMAC output is intended to be a certain number t of bytes in 
length, so the operator Aj(M) takes the left-most t bytes of M. 

The careful reader will observe that the definition of HMAC is a hash around 
another hash, of the form H(X1 II H(K2 II M)) using keys XI and X2. This structure 
resists so-called extension attacks in which a selected pad value can be combined 
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(e.g., by Mallory) with an intercepted message and digest value to form a new, 
valid message and digest value (not sent by Alice). The values of ipad and opad are 
nof crifical buf fend fo produce K1 and K2 values wifh few bifs in common (i.e., 
fhey have a large hamming distance). Cerfain exfension attacks have been shown fo 
be effecfive againsf naively consfrucfed MACs such as fhose of fhe form H{K 11 
M) or H(M 11 K) buf ineffecfive againsf fhe HMAC consfrucf (or NMAC consfrucf 
[BCK96], of which HMAC is a derivafive)[B06]. 

More recenfly, ofher forms of MACs have been sfandardized, called fhe cipher- 
based MAC (CMAC) [FIPS800-38B] and GMAC [NIST800-38D]. Insfead of using a 
crypfographic hash funcfion such as HMAC, fhese use a block cipher such as AES 
or 3DES. CMAC is envisioned for use in environmenfs where if is more convenienf 
or efficienf fo use a block cipher in place of a hash funcfion. Defails of CMAC 
using AES-128, called AES-CMAC, are provided in [REC4493]. In essence, if works 
by encrypfing a message block using AES-128 wifh a key K, faking fhe resulf and 
XORing if wifh fhe subsequenf block, encrypfing fhe resulf, and repeafing fhe pro¬ 
cess unfil no more message blocks remain, wifh fhe oufpuf value being fhe resulf 
of fhe final encrypfion operafion. If fhe final message block's lengfh is an even 
mulfiple of fhe algorifhm's block size, one subkey, derived from K using a special 
subkey-generafing algorifhm [IK03], is used in performing fhe final encrypfion. 
If nof, fhe final message block is firsf padded and a second subkey, also generafed 
from K, is used fo perform fhe final encrypfion. GMAC uses a special mode of 
AES called Galois/Counter Mode (GCM). If also uses a keyed hash funcfion (called 
GHASH, which is nof a crypfographic hash funcfion). We will see more abouf 
crypfographic operafing modes in fhe nexf secfion. 

18.4.10 Cryptographic Suites and Cipher Suites 

At this point we have seen mechanisms to ensure confidentiality, authenticity, and 
integrity of information sent across an unsecured communication network. There 
are other capabilities (e.g., nonrepudiation) that can also be achieved by selecting 
the appropriate mathematical or cryptographic techniques. The combination of 
techniques used in a particular system, especially those we see used with Internet 
protocols, are called a cryptographic suite or sometimes a cipher suite, although the 
first term is more accurate. A cryptographic suite defines not only an enciphering 
(encryption) algorithm but may also include a particular MAG algorithm, PRE, 
key agreement algorithm, signature algorithm, and associated key lengths and 
parameters. 

Many cryptographic suites are defined for use with the security protocols 
we shall discuss. Usually, an encryption algorithm is specified by its name and 
description, how many bits are used for its keys (often a multiple of 128 bits), along 
with its operating mode. Encryption algorithms that have been standardized for 
use with Internet protocols include AES, 3DES, NULL [REG2410], and GAMEL- 
LIA [REG3713]. The NULL encryption algorithm does not modify the input and is 
used in certain circumstances where confidentiality is not required. 
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The operating mode of an encryption algorithm, especially a block cipher, 
describes how to use the encryption function for a single block repeatedly (e.g., 
in a cascade) to encrypt or decrypt an entire message with a single key. Common 
modes today include cipher block chaining (CBC) and counter (CTR) mode, although 
many others have been defined. When performing encryption using CBC mode, 
a cleartext block to be encrypted is first XORed with the previous ciphertext block 
(the first block is XORed with a random initialization vector or IV). Encrypting in 
CTR mode involves first creating a value combining a nonce (or IV) and a counter 
that increments with each successive block to be encrypted. The combination is 
then encrypted, the output is XORed with a cleartext block to produce a ciphertext 
block, and the process repeats for successive blocks. In effect, this approach uses a 
block cipher to produce a keystream. A keystream is a sequence of (random-appear¬ 
ing) bits that are combined (e.g., XORed) with cleartext bits to produce a cipher- 
text. Doing so essentially converts a block cipher into a stream cipher because no 
explicit padding of the input is required. 

CBC requires a serial process for encryption and a partly serial process for 
decryption, whereas counter mode algorithms allow more efficient fully paral¬ 
lel encryption and decryption implementations. Consequently, counter mode is 
gaining popularity. In addition, variants of CTR mode (e.g., counter mode with 
CBC-MAC (CCM), Galois Counter Mode, or GCM) can be used for authenticated 
encryption [RFG4309], and possibly to authenticate (but not encrypt) additional 
data (called authenticated encryption with associated data or AEAD) [REG5116]. When 
authenticated encryption algorithms are used, separate MAGs are generally not 
necessary. In the degenerate case of an AEAD algorithm operating on data that 
does not require confidentiality, a form of MAG is effectively produced (e.g., 
GMAG). When an encryption algorithm is specified as part of a cryptographic 
suite, its name usually includes the mode, and the key length is often implied. Eor 
example, ENGR_AES_GTR refers to AES-128 used in GTR mode. 

When a PRE is included in the definition of a cryptographic suite, it is usu¬ 
ally based on a cryptographic hash algorithm family such as SHA-2 [REG6234] or 
a cryptographic MAG such as GMAG [REG4434][REG4615]. Gonstructions of this 
type generally include the name of the function serving as the basis. Eor example, 
the algorithm AES-GMAG-PRE-128 refers to a PRE constructed using a GMAG 
based on AES-128. It is also written as PRE_AES128_GMAG. The algorithm PRF_ 
HMAG_SHA1 refers to a PRE based on HMAG-SHAl. 

Key agreement parameters, when included with an Internet cryptographic 
suite definition, refer to DH group definitions, as no other key agreement pro¬ 
tocol is in widespread use. When DH key agreement is used in generating keys 
for a particular encryption algorithm, care must be taken to ensure that the keys 
produced are of sufficient length (strength) to avoid compromising the security 
of the encryption algorithm. Gonsequently, more than 16 groups for use with DH 
in different contexts have been standardized [REG5114]. The first 5 have become 
known as the "Oakley Groups" because they were specified by the Oakley pro¬ 
tocol [RFG2409], an early component of IPsec that has since been deprecated. The 
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modular exponential or MODP groups are based on exponentiation and modular 
arithmetic. The elliptic curve groups modulo a prime or ECP groups [RFC5903] are 
based on curves over the Galois field GF(P) for a prime P, and the elliptic curve 
groups modulo a power of two or EC2N are based on curves over the field GF{2^) for 
some N. 

A signature algorithm is sometimes included in the definition of a crypto¬ 
graphic suite. It may be used for signing a variety of values including data, MAGs, 
and DH values. The most common is to use RSA to sign a hashed value for some 
block of data, although the digital signature standard (written as DSS or DSA to indi¬ 
cate the digital signature algorithm) [FIPS186-3] is also used in some circumstances. 
With the advent of EGG, signatures based on elliptic curves (e.g., EGDSA [X9.62- 
2005]) are also now supported in many systems. 

The concept of a cryptographic suite evolved in the context of Internet secu¬ 
rity protocols because of a need for modularity and decoupled evolution. As com¬ 
putational power has improved, older cryptographic algorithms and smaller key 
lengths have fallen victim to various forms of brute-force attacks. In some cases, 
more sophisticated attacks have revealed flaws that necessitate the replacement of 
the underlying mathematical and cryptographic methods, but the basic protocol 
machinery is otherwise sound. As a result, the choice of a cryptographic suite can 
now be made separately from the communication protocol details and depends 
on factors such as convenience, performance, and security. Protocols tend to make 
use of the components of a cryptographic suite in a standard way, so an appro¬ 
priate cryptographic suite can be "snapped in" when deemed appropriate. It is 
now common practice in protocol design to "outsource" the security processing 
to a separately defined set of cryptographic suites that have been analyzed by a 
large community with the necessary cryptographic and mathematical expertise. 
Although the ability to "snap in" a new cipher suite is appealing, it can still take 
years to standardize on acceptable suites and get them deployed. For interoper¬ 
ability, each participant in a communication exchange must usually employ the 
same suite. This can be a significant hurdle when cipher suites may be imple¬ 
mented in a wide range of software and hardware systems. 


18.5 Certificates, Certificate Authorities (CAs), and PKIs 

The tools provided by cryptography and related mathematics, including digital 
signatures and enciphering algorithms, provide a sound basis for constructing 
secure systems, but a great deal of additional work is required to create an entire 
system from these parts. Among the items of particular concern are the construc¬ 
tion of secure protocols that use cryptographic methods in safe ways, and how 
keys are created, exchanged, and revoked (called key management). Key manage¬ 
ment remains one of the greatest challenges in deploying cryptographic systems 
on a widespread basis across multiple administrative domains. 
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One of the challenges with public key cryptosystems is to determine the cor¬ 
rect public key for a principal or identity. In our running example, if Alice were to 
send her public key to Bob, Mallory could modify it in transit to be her own public 
key, and Bob (called the relying party here) might unknowingly be using Mallory's 
key, thinking it is Alice's. This would allow Mallory to effectively masquerade as 
Alice. To address this problem, a public key certificate is used to bind an identity to 
a particular public key using a digital signature. At first glance, this presents a 
certain "chicken-egg" problem: How can a public key become signed if the digital 
signature itself requires a reliable public key? There are two ways this is accom¬ 
plished today. 

One model, called a web of trust, involves having a certificate (identity/key 
binding) endorsed by a collection of existing users (called endorsers). An endorser 
signs a certificate and distributes the signed certificate. The more endorsers for 
a certificate over time, the more reliable it is likely to be. An entity checking a 
certificate might require some number of endorsers or possibly some particular 
endorsers to trust the certificate. The web of trust model is decentralized and 
"grassroots" in nature, with no central authority. This has mixed consequences. 
Having no central authority suggests that the scheme will not collapse because 
of a single point of failure, but it also means that a new entrant may experience 
some delay in getting its key endorsed to a degree sufficient to be trusted by a 
significant number of users. Some groups hold "key signing parties" to hasten 
this process. The web of trust model was first described as part of the Pretty Good 
Privacy (PGP) encryption system for electronic mail [NAZOO], which has evolved 
to support a standard encoding format called OpenPGP, defined by [RFC4880]. 

A more formal approach, which has the added benefit of being provably 
secure under certain theoretical assumptions in exchange for more dependence 
on a centralized authority, involves the use of a public key infrastructure (PKI). A 
PKI is a service responsible for creating, revoking, distributing, and updating key 
pairs and certificates. It operates with a collection of certificate authorities (CAs). A 
CA is an entity and service set up to manage and attest to the bindings between 
identities and their corresponding public keys. There are several hundred com¬ 
mercial CAs. A CA usually employs a hierarchical signing scheme. This means that 
a public key may be signed using a parent key which is in turn signed by a grand¬ 
parent key, and so on. Ultimately a CA has one or more root certificates upon which 
many subordinate certificates depend for trust. An entity that is authoritative for 
certificates and keys (e.g., a CA) is called a trust anchor, although this term is also 
used to describe the certificates or other cryptographic material associated with 
such entities [RFC6024], which we discuss next. 

18.5.1 Public Key Certificates, Certificate Authorities, and X.509 

While several types of certificates have been used in the past, the one of most inter¬ 
est to us is based on an Internet profile of the ITU-T X.509 standard [RFC5280]. In 
addition, any particular certificate may be stored and exchanged in a number of 
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file or encoding formats. The most common ones include DER, PEM (a Base64 
encoded version of DER), PKCS#7 (P7B), and PKCS#12 (PPX). We also saw the use 
of PKCS#1 [RPC3447] in Chapter 8. Today, Internet PKI-related standards tend to 
use the cryptographic message syntax [RPC5652], which is based on PKCS#7 version 
1.5. In the following example, we use an X.509 certificate in PEM format, which is 
the default format for many Internet applications and has the added advantage of 
being easily displayed as ASCII. 

Certificates are primarily used in identifying four types of entities on the 
Internet: individuals, servers, software publishers, and CAs. One popular com¬ 
mercial CA, Verisign, assigns a "class" to each certificate, in the range 1 through 
5. Class 1 certificates are intended for individuals, class 2 for organizations, class 
3 for servers and software signing, class 4 for online transactions between com¬ 
panies, and class 5 for private organizations and governments. Certificate classes 
are primarily a convenience for grouping and naming types of certificates and for 
defining different security policies associated with them. Generally speaking, a 
higher class number is supposed to indicate more rigorous controls on the process 
required to validate an identity (called identity proofing) prior to issuing the associ¬ 
ated certificate. 

This still does not totally solve the chicken-egg PKI bootstrapping problem 
mentioned before. In practice, systems requiring public key operations have root 
certificates for popular CAs installed at configuration time (e.g., Microsoft Inter¬ 
net Explorer, Mozilla's Eirefox, and Google's Ghrome are all capable of accessing 
a preconfigured database of root certificates). To see how this works, we can use 
a command that gives information about certificates. The openssl command, 
available for most common platforms including Linux and Windows, allows us to 
see the certificates for a Web site (some lines are wrapped for clarity): 

Linux% CDIR='openssl version -d | awk '{print $2}'' 

Linux% openssl s_client -CApath $CDIR \ 

-connect www.digicert.com:443 > digicert.out 2>1 

(to interrupt) 


The first command determines where the local system stores its preconfig¬ 
ured CA certificates. This is usually a directory that varies by system. In this case, 
the name of the directory is stored in the shell variable CDIR. We next make a con¬ 
nection to the HTTPS port (443) on the www.digicert.com server and redirect 
the output to the digicert.out file. The openssl command^ takes care to print 
the entity identified by each of the certificates, and at what depth they are in the 
certificate hierarchy relative to the root (depth 0 is the server's certificate, so the 
depth numbers are counted bottom to top). It also checks the certificates against 
the stored CA certificates to see if they verify properly. In this case, they do, as 
indicated by "verify return" having value 0 (ok). 


2. Note that a similar command unique to Windows called certutil is available with Windows 
2003 Server and the Windows Server 2003 Administration Tools Pack. 
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Linux% grep "return code" digicert.out 

Verify return code: 0 (ok) 


The file digicert.out contains not only a trace of the connection to the 
server but also a copy of the server's certificate. To get the certificate into a more 
usable form, we can extract the certificate data, convert it, and place the result into 
a PEM-encoded certificate file: 


Linux% openssl x509 -in digicert.out -out digicert.pern 


Given the certificate in PEM format, we can now use a variety of openssl 
functions to manipulate and inspect it. At the highest level, the certificate includes 
some data to be signed (called the "TBSCertificate") followed by a signature algo¬ 
rithm identifier and signature value. To see the server certificate, we can use the 
following command (some lines are wrapped or removed for clarity): 


Linux% openssl x509 -in digicert.pern -text 

Certificate: 

Data: 

Version: 3 (0x2) 

Serial Number: 

02:c7:lf:e0:Id :70:41:4b:8b:a7:e2:9e:5e:58:42:b9 
Signature Algorithm: shalWithRSAEncryption 
Issuer: C=US, 0=DigiCert Inc, OU=www.digicert.com, 
CN=DigiCert High Assurance EV CA-1 

Validity 

Not Before: Oct 6 00:00:00 2010 GMT 
Not After : Oct 9 23:59:59 2012 GMT 
Subject: 2.5.4.15=V1.0, Clause 5.(b)/ 
1.3.6.1.4.1.311.60.2.1.3=us/ 
1.3.6.1.4.1.311.60.2.1.2=Utah/ 
serialNumber=5299537-0142, 

C=US, ST=Utah, L=Lindon, 0=DigiCert, Inc., 

CN=www.digicert.com 
Subject Public Key Info: 

Public Key Algorithm: rsaEncryption 
RSA Public Key: (2048 bit) 

Modulus (2048 bit): 

00:dl:76:0b:le:4e:96:d2:08:cl:b8:75:bd:20:9c: 
66:7f:42:6b:54:8b:7f:7a:4a:f8:3e:df:70:68:lf: 

25:7b:40:e9:e3:cc:a2:0d:95:29:f4:08:ed:50:16: 
52:ll:6f:de:a0:bb:34:bc:8b:b5:60:cl:ab:e4:78: 
75:9f 

Exponent: 65537 (0x10001) 

X509v3 extensions: 

X509v3 Authority Key Identifier: 

keyid:4C:58:CB:25:F0:41:4F:52:F4: 

28:C8:81:43:9B:A6:A8:A0:E6:92:E5 
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X509v3 Subject Key Identifier: 

4F:E0:97:FF:C1:AE:06:53:03:19:F7: 

OA:37:4B:9F:F0:13:E2:88:D8 
X509v3 Subject Alternative Name: 

DNS:www.digicert.com, DNS:content.digicert.com 
Authority Information Access: 

OCSP - URI:http://ocsp.digicert.com 
CA Issuers - URI: 

http:/ /WWW. digicert.com/CACerts/ 
DigiCertHighAssuranceEVCA-1.crt 
Netscape Cert Type: 

SSL Client, SSL Server 
X509v3 Key Usage: critical 

Digital Signature, Key Encipherment 
X509v3 Basic Constraints: critical 
CA:FALSE 

X509v3 CRL Distribution Points: 

URI:http://crl3.digicert.com/ev2009a.crl 
URI:http://crl4.digicert.com/ev2009a.crl 
X509v3 Certificate Policies: 

Policy: 2.16.840.1.114412.2.1 

CPS: http://www.digicert.com/ssl-cps-repository.htm 
User Notice: 

Explicit Text: 

X509v3 Extended Key Usage: 

TLS Web Server Authentication, 

TLS Web Client Authentication 
Signature Algorithm: shalWithRSAEncryption 

el:e6:dd:0e:23:5f:08:9a:63:63:c7:al:f3:95:f0:ca:7e:3c: 

57:81:2c:2a:19:2b:24:fe:e4:26:bd:91:27:7c:ll:50:35:e7: 

fd:64:6f:97:8b:15:fb:dl:7a:f7:67:80:da:da:41:d8:e3:f9: 
e4:bd:92:97 

-BEGIN CERTIFICATE- 

MIIHLTCCBhWgAwIBAgIQAscf4BlwQUuLp+KeXlhCuTANBgkghkiG9wOBAQUFADBp 

MQswCQYDVQQGEwJVUzEVMBMGAlUEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3 

8+qQ0wF/xY9rHM0+eIgy3da4AFhfW4sAmyafs7hcEMjUAkS6Yb0qIw8ud/lkb5eL 
FfvRevdngNraQdj j +eS9kpc= 

-END CERTIFICATE- 

Looking at the command's output, we see a decoded version of the certificate 
followed by an ASCII (PEM) representation of the certificate (between the BEGIN 
CERTIFICATE and END CERTIFICATE indicators). The decoded certificate shows 
a data portion and a signature portion. Within the data portion is some metadata 
including a Version field, indicating the particular X.509 certificate type (3, the 
most recent, is encoded using hex value 0x02), a Serial Number of the particular cer¬ 
tificate, a number assigned by the CA unique to each certificate, and a Validity field 
that gives the time during which the certificate should be treated as legitimate. 
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starting with the Not Before subfield and ending with the Not After subfield. The 
certificate metadata also indicates which signature algorithm is used to sign the 
data portion. In this case, it is signed by computing a hash using SHA-1 and sign¬ 
ing the result using RSA. The signature itself appears at the end of the certificate. 

The Issuer field indicates the distinguished name (jargon from the ITU-T X.500 
standard) of the entity that issued the certificate and may have these special sub¬ 
fields (based on X.501): C (country), L (locale or city), O (organization), OU (orga¬ 
nizational unit), ST (state or province), CN (common name). Other subfields have 
also been defined. In this case, we can see that an extended validation (EV) [CABF09] 
CA certificate has been used to sign the server's certificate. 

EV certificates represent an industry response to certain phishing attacks 
involving malicious Web sites that were issued certificates without rigorous iden¬ 
tity proofing. Issuing of an EV certificate takes place only under an agreed-upon 
set of stringent criteria, and a user visiting a Web site using EV certificates and a 
modern browser typically sees a green title bar and CA information to indicate the 
enhanced level of rigor. One of the requirements for EV certificates placed upon 
each CA is to provide a certification practice statement (CPS), which outlines the 
practices used in issuing certificates. Considerations for authors of CPSs (and cer¬ 
tificate policies or CPs that apply on a per-certificate basis) are given in [RFC5280]. 
Note that although EV certificates may provide higher assurance (e.g., for some 
Web sites), most users do not pay careful attention to the cues provided by Web 
browsers that reveal this fact [BOPSW09]. 

The Subject field identifies the entity this certificate is about, and the owner 
of the public key contained in the subsequent Subject Public Key Info field. In this 
example, the Subject field is a somewhat complex structure like the Issuer field 
and contains multiple object IDs (OIDs) [ITUOID]. Most are decoded with names 
(e.g., O, C, ST, L, CN), but some are not because the particular version of openssl 
that printed the output did not understand them. The OID 1.3.6.1.4.1.311.60.2.1.3 
is also called jurisdictionOfIncorporationCountryName, and 1.3.6.1.4.1.311.60.2.1.2 
is called jurisdictionOfIncorporationStateOrProvinceName, both with obvious 
meanings. The OID 2.5.4.15 is businessCategory (see [CABF09] for details). Note 
that the CN subfield tends to be an important one when identifying subjects and 
issuers for certificates used on the Internet. For this certificate, it gives the cor¬ 
rect matching name for the server (along with any names included in the Subject 
Alternative Name (SAN) extension). Nonmatching names or URLs (e.g., https: // 
digicert.com instead of https ://www.digicert.com) referring to the same 
server, when accessed, result in an error. Note that CN is not really the field for 
holding a DNS name; SANs are intended for this purpose. 

When a certificate needs to be validated, a recursive process works up the 
certificate hierarchy to a root CA certificate by matching the issuer distinguished 
name in one certificate with the subject name in another. In this case, the certificate 
was issued by DigiCert High Assurance EV CA-1 (the issuer's CN sub field). 
Assuming all certificates are current in their validity periods and are being used 
in appropriate ways, some parent certificate (immediate parent, grandparent, etc.. 
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but usually a root CA certificate) to the Subject field of the certificate we are evalu¬ 
ating must be trusted for validation to be successful. 

The Subject Public Key Info field gives the algorithm and public key belonging 
to the entity specified in the Subject field. In this case, the public key is an RSA 
public key with a 2048-bit modulus and public exponent of 65537. The subject is 
in possession of the matching RSA private key (modulus plus private exponent) 
that is paired to the public key. If the private key is compromised, or if the public 
key needs to be changed for other reasons, the public and private keys must be 
regenerated and a new certificate issued. The old certificate is then revoked (see 
Section 18.5.2). 

Version 3 X.509 certificates may include zero or more extensions. Extensions 
are either critical or noncritical, and some are required by the Internet profile in 
[RFC5280]. If critical, an extension must be processed and found acceptable by 
the relying party's (CPS jargon) policy. Noncritical extensions are processed if 
supported but do not otherwise cause errors. In the present example, there are 
ten X.509v3 extensions. Although many extensions have been defined, those we 
shall discuss tend to fall into two informal categories. The first category includes 
information about the subject and how the certificate in question can be used. 
The second category relates to items describing the issuer and may include key 
identification and URIs indicating locations of additional information related to 
the issuing CA that is not included elsewhere. The certificate in our example is an 
end entity (not CA) certificate. CA certificates often have somewhat different exten¬ 
sions or values for their extensions. 

The Basic Constraints extension, a critical extension, indicates whether the cer¬ 
tificate is a CA certificate. In this case it is not, so it cannot be used for signing 
other certificates. A certificate indicating that it is a CA certificate may be used in 
a certificate validation chain at a location other than a leaf. This is common for root 
CA certificates or for other certificate-signing certificates ("intermediate" certifi¬ 
cates, such as the DigiCert High Assurance EV CA-1 certificate referenced 
in this example). 

The Subject Key Identifier extension identifies the public key in the certificate. 
It allows different keys owned by the same subject to be differentiated. The Key 
Usage extension, a critical extension, determines the valid usage for the key. Pos¬ 
sible usages include digital signature, nonrepudiation (content commitment), key 
encipherment, data encipherment, key agreement, certificate signing, CRT signing 
(see Section 18.5.2), encipher only, and decipher only. Because server certificates of 
this kind are primarily used for identifying the two endpoints of a connection and 
encrypting a session key (see Section 18.9), the possible usages may be somewhat 
limited, as in this case. The Extended Key Usage extension, which may be critical or 
noncritical, may provide further restrictions on the key use. Possible values of this 
extension when used in the Internet profile include the following: TLS client and 
server authentication, signing of downloadable code, e-mail protection (nonrepu¬ 
diation and key agreement or encipherment), various IPsec operating modes (see 
Section 18.8), and timestamping. The SAN extension allows a single certificate to be 
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used for multiple purposes (e.g., for multiple Web sites with distinct DNS names). 
This alleviates the need to have a separate certificate for each Web site, which can 
significantly reduce cost and administrative burden. In this case, the certificate can 
be used for either of the DNS names www. digicert. com orcontent.digicert. 
com (but not digicert.com, as mentioned before). The Netscape Cert Type exten¬ 
sion is now deprecated but was used to indicate key usage to Netscape software. 

The remaining extensions in our example certificate relate to the manage¬ 
ment and status of the certificate and its issuing CA. The CRL Distribution Points 
(CDP) extension gives a list of URLs for finding the CA's certificate revocation list 
(CRL), a list of revoked certificates used to determine if a certificate in a valida¬ 
tion chain has been revoked (see Section 18.5.2). The Certificate Policies (CP) exten¬ 
sion includes certificate policies applicable to the certificate [RFC5280]. In this 
example, the CP extension contains a policy with two qualifiers. The Policy value of 
2.16.840.1.114412.2.1 indicates that the certificate complies with an EV policy. The 
CPS qualifier gives a pointer to the URI where the particular applicable CPS for 
the policy may be found. The User Notice qualifier may contain text intended to be 
displayed to a relying party. In this case it contains the following string: 

Any use of this Certificate constitutes acceptance of the DigiCert EV CPS and the 

Relying Party Agreement which limit liability and are incorporated herein by 

reference. 

The Authority Key Identifier identifies the public key corresponding to the pri¬ 
vate key used to sign the certificate. It is useful when an issuer has multiple pri¬ 
vate keys used for generating signatures. The Authority Information Access (AIA) 
extension indicates where information may be retrieved from the CA. In this case, 
it indicates a URI used to determine if the certificate has been revoked using an 
online query protocol (see Section 18.5.2). It also indicates the list of CA issuers, 
which includes a URL containing the CA certificate responsible for signing the 
example server certificate. 

Following the extensions, the certificate contains the signature portion. It con¬ 
tains the identification of the signature algorithm (SHA-1 with RSA here), which 
must match the Signature Algorithm field we encountered earlier. In this case, the 
signature itself is a 256-byte value, corresponding to the 2048-bit modulus used 
for this use of RSA. 

18.5.2 Validating and Revoking Certificates 

We have already encountered the idea that a certificate may have to be revoked 
and possibly replaced with a freshly issued certificate. Within the IETF, [RFC5280] 
defines the use of X.509 version 3 certificates with X.509 version 2 CRLs for the 
Internet. This brings up the question of how a certificate is revoked and how this 
fact is made known to relying parties that need to know that the certificates on 
which they depend are no longer trustworthy. 
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To validate a certificate, a validation or certification path must be established that 
includes a set of validated certificates, usually up to some trust anchor (e.g., root 
certificate) that is already known to the relying party. One of the key steps involves 
determining if one or more of the certificates in a chain have been revoked. If so, 
the path validation fails. We saw some of this in Section 8.5.5. 

There are several reasons why a certificate may need to be revoked, such as 
when a certificate's subject (or issuer) changes affiliations or name. When a certifi¬ 
cate is revoked, it may no longer be used. The challenge is to ensure that entities 
that wish to use a certificate become aware if it has been revoked. In the Internet, 
there are two primary ways this is accomplished: CRTs and the Online Certifi¬ 
cate Status Protocol (OCSP) [RFC2560]. When the CRL Distribution Point extension 
includes an HTTP or FTP URI scheme, as it does in the preceding example, the 
complete URL gives the name of a file encoded in DER format containing an X.509 
CRL. In our example, we can retrieve the CRL corresponding to the certificate 
using the following command: 


Linux% wget httpz//crl3.digicert.com/ev2009a.crl 


and print it out as follows: 


Linux% openssl crl -inform der -in ev2009a.crl -text 

Certificate Revocation List (CRL): 

Version 2 (0x1) 

Signature Algorithm: shalWithRSAEncryption 
Issuer: /C=US/0=DigiCert Inc/OU=www.digicert.com/ 
CN=DigiCert High Assurance EV CA-1 
Last Update: Jan 2 06:20:13 2011 GMT 
Next Update: Jan 9 06:20:00 2011 GMT 
CRL extensions: 

X509v3 Authority Key Identifier: 

keyid:4C:58:CB:25:F0:41:4F:52:F4: 

28:C8:81:43:9B:A6:A8:A0:E6:92:E5 


X509v3 CRL Number: 

732Revoked Certificates: 

Serial Number: 0119BF8D1A24460EBE59355A11AD7B1C 
Revocation Date: Jul 29 19:25:40 2009 GMT 
CRL entry extensions: 

X509v3 CRL Reason Code: 

Unspecified 


Serial Number: 0D2ED685A9A828A21067D1826C5015A9 
Revocation Date: Dec 17 17:18:40 2010 GMT 
CRL entry extensions: 

X509v3 CRL Reason Code: 

Superseded 

Signature Algorithm: shalWithRSAEncryption 

d4:a3:50:07:lb:b8:17:ff:e2:83:3d:b9:6a:3e:22:8d:e4:22: 
40:12:0b:cf:26:d9:16:99:bl:96:5a:86:ea:3e:8a:3f:f9:39: 
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C7:e0:92:f6:66:72:7e:a4:f0:fd:16:d4:ec:2f:10:35:ea:2d: 

45:06:19:4b 

-BEGIN X509 CRL- 

MIIHeDCCBmACAQEwDQYJKoZIhvcNAQEFBQAwaTELMAkGAlUEBhMCWMxFTATBgNV 

BAoTDERpZ21DZXJ0IEluYzEZMBcGAlUECxMQd3d3LmRpZ21jZXJ0LmNvbTEoMCYG 

hzcRf+ITVZ7 6LtHdzWDDPFuj PYgPzMnkbGgGVsve9Gd4NcQioz0YOCDvaLezg069 
EYmMaYk9zXFSaBVdEZ5Tgekrj OfFnsfgkvZmcn6k8POW10wvEDXqLUUGGUs = 

-END X509 CRL- 

Here we can see the format of an X.509 v2 CRL. The format is very similar 
to that of a certificate, and the entire message is signed by a CA as certificates 
are. This is useful because CRTs can be distributed like certificates: using oth¬ 
erwise untrusted communication channels and servers. In comparison with a 
certificate, the validity period is replaced by a list of the previous and next CRL 
updates. There is no subject and no public key but instead a list of serial numbers 
for revoked certificates plus the time and reason for revocation. There may also be 
CRL extensions that are unique to CRLs. In this example, the Authority Key Identi¬ 
fier extension gives a number identifying the key used by the CA in signing the 
CRL. The CRL Number extension gives the sequence number of the CRL. Other 
values are given in [RFC5280]. 

The other primary method for determining if a certificate has been revoked 
is OCSR OCSP is an application-level request/response protocol usually oper¬ 
ated over HTTP (i.e., using the HTTP protocol with TCP/IP on TCP port 80). An 
OCSP request includes information identifying a particular certificate, plus some 
optional extensions. A response indicates whether the certificate is not revoked, 
unknown, or revoked. An error may be returned if the request cannot be parsed 
or otherwise acted upon. The key used for signing the OCSP response need not 
necessarily match the key used to sign the original certificate. This is possible if 
the issuer included a Key Usage extension indicating an alternate OCSP provider. 

To see an OCSP request/response exchange, we can execute the following 
commands once we have obtained the appropriate Class 1 certificate in the file 
DigiCertHighAssuranceEVCA-l.pem (not shown). In the following example, 
some lines are wrapped for clarity: 


Linux% CERT=DigiCertHighAssuranceEVCA-l.pern 

Linux% openssl ocsp -issuer $CERT -cert digicert.pern \ 

-url http://ocsp.diglcert.com -VAfile $CERT -no_nonce -text 

OCSP Request Data: 

Version: 1 (0x0) 

Requestor List: 

Certificate ID: 

Hash Algorithm: shal 

Issuer Name Hash: B8A299F09D06IDD5CI588F76CC89FF57092B94DD 
Issuer KeY Hash: 4C58CB25F04I4F52F428C88I439BA6A8A0E692E5 
Serial Number: 02C7IFE0ID704I4B8BA7E29E5E5842B9 
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OCSP Response Data: 

OCSP Response Status: successful (0x0) 

Response Type: Basic OCSP Response 
Version: 1 (0x0) 

Responder Id: 4C58CB25F0414F52F428C881439BA6A8A0E692E5 
Produced At: Jan 2 08:03:24 2011 GMT 
Responses: 

Certificate ID: 

Hash Algorithm: shal 

Issuer Name Hash: B8A299F09D061DD5C1588F76CC89FF57092B94DD 
Issuer Key Hash: 4C58CB25F0414F52F428C881439BA6A8A0E692E5 
Serial Number: 02C71FE01D70414B8BA7E29E5E5842B9 
Cert Status: good 

This Update: Jan 2 08:03:24 2011 GMT 
Next Update: Jan 9 08:18:24 2011 GMT 

Response verify OK 
digicert.pern: good 

This Update: Jan 2 08:03:24 2011 GMT 
Next Update: Jan 9 08:18:24 2011 GMT 

As we can see, the OCSP transaction has indicated that the certificate is good. 
The request included the identification of a hash algorithm (SHA-1), a hash of the 
issuer name, a number identifying the issuer's key (the same as the Key ID exten¬ 
sion in the certificate), plus the certificate's serial number. The responder, identi¬ 
fied by the responder ID, identifies itself and signs the response. The response 
includes the hashes and numbers from the request, as well as the certificate status 
of "good" (i.e., not revoked). The OCSP protocol alleviates the client from having 
to download the latest CRT to check but still requires the client to form and verify 
the entire certification path. In some cases, this can be a considerable burden for 
the client. 

To help address the burden of certificate chain formation and validation 
imposed on client systems, the Server-Based Certificate Validation Protocol (SCVP) 
has been defined in [RFC5055] but is not widely used. With SCVP, formulation of 
a certification path (called delegated path discovery or DPD) and, optionally, valida¬ 
tion (called delegated path validation or DPV) of it can be offloaded to a server. Vali¬ 
dation is offloaded only to a trusted server. Not only does this provide a method 
to reduce the load on clients, but it also offers a method for helping to ensure that 
a common validation policy is used consistently throughout an enterprise. 


18.5.3 Attribute Certificates 

In addition to public key certificates (PKCs) used to bind names to public keys, 
X.509 defines another type of certificate called an attribute certificate (AC). ACs are 
similar in structure to PKCs but lack a public key. They are used to indicate other 
information, including authorization information that may have a lifetime differ¬ 
ent from (e.g., shorter than) a corresponding PKC [RFC5755]. ACs contain other 
structures similar to PKCs, including extensions and AC policies. 
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18.6 TCP/IP Security Protocols and Layering 

We have seen that cryptography provides a basis for building communication sys¬ 
tems that have a number of desirable security properties. Protocols involving cryp¬ 
tography can (and do) exist at a number of different layers in the protocol stack. 
Consistent with our understanding of the OSI reference model we discussed in 
Chapter 1, we now see that encryption, and thus various forms of strong security, 
can be supported at essentially every layer. 

As we might expect, security services at the link layer protect information 
only as it flows across a single communication hop, security at the network layer 
protects information flowing between hosts, security at the transport layer pro¬ 
tects process-to-process communication, and security at the application layer pro¬ 
tects information manipulated by applications. It is also possible to protect the 
data manipulated by applications independently of the communication layers 
(e.g., files can be encrypted and sent as e-mail attachments). Figure 18-4 illustrates 
the most common security protocols used in conjunction with TCP/IP. 


Layer 

Number 

Layer 

Name 

Examples 

7 


Application 


DNSSEC, DKIM, EAP, Diameter, RADIUS, SSH, Kerberos, IPsec (IKE) 




4 


Transport 


TLS, DTLS, PANA 




3 


Network 


IPsec (ESP) 




2 


Link 


802.1X(EAPoL), 802.1AE(MACSec), 802.11i/WPA2, EAP 


Figure 18-4 Security protocols exist at essentially every OSI stack layer, plus some "in-between" lay¬ 
ers. Selecting the appropriate protocols for the threats to be addressed requires atten¬ 
tion to detail. 


In Figure 18-4, we can see that there are many security protocols, and the ones 
we care about at any given time depend on what scope of functionality we require. 
We shall discuss most of the protocols in Figure 18-4 in what follows, with par¬ 
ticular emphasis on IPsec (machine-to-machine security at layer 3), TLS (Trans¬ 
port Layer Security designed for supporting applications), and DNSSEC. TLS and 
IPsec are the most prevalent, as TLS is used with all secure Web communications 
(HTTPS) and IPsec is used with most network-layer security, including VPNs. 
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DNSSEC, which secures the DNS (see Chapter 11), is being introduced slowly, but 
the perceived demand is significant. Security of fhe DNS will help fo limif DNS 
hijacking affacks, in which clienf sysfems are redirecfed fo bogus DNS servers fhaf 
supply incorrecf informafion. Two of fhe fairly popular protocols we do nof dis¬ 
cuss in defail are Kerberos [RFC4120]—a frusfed fhird-parfy aufhenficafion sys- 
fem now used in Windows enferprise environmenfs—and SSH [RFC4251]—fhe 
secure shell remofe login and funneling protocol used mosf offen wifh UNIX-like 
sysfems. These profocols fend fo be used among compufers running parficular 
operafing sysfems, alfhough fhis is by no means required. We have elected fo use 
fhe defailed protocol descripfions in fhis chapter fo cover fhe profocols fhaf we 
believe will apply fo an even broader Infernef audience over fime. 

Alfhough virfually every modern nefworking technology has some associ¬ 
ated securify approach, we shall move up fhe layers in fhe OSI sfack from fhe bof- 
fom, sfarfing wifh fhe link layer. We have already seen (see Chapfer 3) fhaf some of 
fhe link-layer profocols have fheir own securify mechanisms (e.g., 802.11-2007 has 
WPA2 included in fhe specificafion, based on fhe earlier 802.11i specificafion). We 
shall be especially concerned wifh profocols fhaf apply fo more fhan one specific 
fype of link layer nefwork. 


18.7 Network Access Control: 802.1X, 802.1AE, EAP, and PANA 

Network Access Control (NAC) refers fo mefhods used fo aufhorize or deny nefwork 
communicafions fo parficular sysfems or users. Defined by fhe IEEE, fhe 802.1X 
Port-Based Network Access Control (PNAC) sfandard is commonly used wifh TCP/ 
IP nefworks fo supporf LAN securify in enferprises, for bofh wired and wireless 
nefworks. The purpose of PNAC is fo provide access fo a nefwork (e.g., infranef or 
fhe Infernef) only if a system and/or ifs user has been aufhenficafed based on fhe 
sysfem's nefwork affachmenf poinf. Used in conjuncfion wifh fhe IETF sfandard 
Exfensible Aufhenficafion Profocol (EAP) [RFC3748], 802.1X is somefimes called 
EAP over LAN (EAPoL), alfhough fhe 802.1X sfandard covers more fhan jusf fhe 
EAPoL packef formaf. 

The mosf common varianf of 802.1X is based on fhe sfandard as published in 
2004, however, [802.1X-2010] includes compafibilify wifh 802.1AE (IEEE sfandard 
LAN encrypfion called MACSec) and 802.1AR (X.509 cerfificafes for secure device 
idenfifies). If also includes a somewhaf complex MACSec key agreement profocol 
called MKA fhaf we do nof discuss furfher. In 802.1X, a system being aufhenfi¬ 
cafed implemenfs a funcfion known as a supplicant. The supplicanf inferacfs wifh 
an authenticator and a backend authentication server fo perform aufhenficafion and 
gain nefwork access. VLANs (see Chapfer 3) are offen used in helping fo enforce 
fhe access confrol decisions made by 802.1X. 

EAP can be used wifh mulfiple link-layer fechnologies and supporfs mulfiple 
mefhods for implementing authentication, authorization, and accounting (AAA). EAP 
does nof perform encrypfion ifself, so if musf be used in conjuncfion wifh some 
ofher cryptographically sfrong profocol fo be secure. When used wifh link-layer 
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encryption such as WPA2 on wireless networks or 802.1AE on wired networks, 
802.1X is relatively secure. EAP uses the same concepts of supplicant and authen¬ 
tication server as does 802.1X, but with different terminology (EAP uses the terms 
peer, authenticator, and AAA server although even in EAP-related literature backend 
authentication server is sometimes used). An example setup is shown in Pigure 18-5. 



Server 

Figure 18-5 EAP, supported by 802.11i and 802.1X, allows for a peer (supplicant) to be authenticated by an 
authenticator that is separate from an AAA server. The authenticator can operate in "pass¬ 
through" mode in which it does little more than forward EAP packets. It can also participate 
more directly in the EAP protocol. The pass-through mode allows authenticators to avoid having 
to implement a large number of authentication methods. 


In this figure we see a hypothetical enterprise network including wired and 
wireless peers, a protected network that includes the AAA server and another 
intranet server on a particular VLAN, and an unauthenticated or "remediation" 
VLAN. The authenticator's job is to interact with unauthenticated peers and the 
AAA server (via AAA protocols such as RADIUS [RPC2865][RPC3162] or Diameter 
[RPC3588]) to determine if each peer should be granted access to the protected net¬ 
work. If so, this can be accomplished in several ways. The most common approach is 
to make a VLAN mapping adjustment so that the authenticated peer is assigned to 
the protected VLAN or to another VLAN that provides connectivity to the protected 
VLAN using a router (layer 3). An authenticator may use VLAN trunking (IEEE 
802.1AX link aggregation; see Chapter 3) and may be capable of assigning VLAN 
tags based on port number or forwarding VLAN tagged frames sent by the peer. 


Note 

In some EAP deployments, the authenticator is used without an AAA server, and 
the authenticator must evaluate the peer’s credentials on its own. When refer¬ 
ring to the location where authentication is determined, the term EAP server is 
used in the EAP literature. Generally, the EAP server is the AAA server (backend 
authentication server) when the authenticator acts in pass-through mode and is 
the authenticator otherwise. 
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In 802.1X, the protocol between the supplicant and the authenticator is divided 
into a lower and upper sublayer. The lower layer is called the port access control 
protocol (PACP). The higher layer is ordinarily some variant of EAP. For use with 
802.1AR, the variant is called EAP-TLS [RFC5216]. PACP uses EAPoL frames for 
communicafion, even if EAP aufhenficafion is nof used (e.g., when MKA is used). 
EAPoL frames use an Ethertype field value of 0x888E (see Chapfer 3). 

Moving fo IETF sfandards, EAP is nof a single protocol buf rafher a frame¬ 
work for achieving aufhenficafion using a combinafion of ofher protocols, some of 
which we discuss fhroughouf fhe chapfer, including TLS and IKEv2. The baseline 
EAP packef formal is shown in Figure 18-6. 


0 15 16 31 


Code 

Identifier 

Length 

Data 


Figure 18-6 The EAP header includes a Code field for demultiplexing packet types (Request, 
Response, Success, Failure, Initiate, Finish). The Identifier helps match requests to 
responses. For request and response messages, the first data byte is a Type field. 


The EAP packet format is simple. In Figure 18-6, the Code field contains one 
of six EAP packet types: Request (1), Response (2), Success (3), Failure (4), Initiate 
(5), and Finish (6). The last two are defined by the EAP Re-authentication Protocol 
(see Section 18.7.2); the official field values are maintained by the lANA [lEAP]. 
The Identifier field contains a number chosen by the sender and is used to match 
requests with replies. The Length field gives the number of bytes in the EAP mes¬ 
sage, including the Code, Identifier, and Length fields. Requests and responses are 
used to perform identification and authentication with the peer, ultimately result¬ 
ing in a Success or Failure indication. The protocol is capable of carrying an infor¬ 
mative message so that human users can be given some instructions about what 
to do if their system is unable to authenticate. It is a reliable protocol that runs on 
a lower-layer protocol that is assumed to preserve order but is not assumed to be 
reliable. EAP itself does not implement other features such as congestion or flow 
control but may use protocols that do. 

The typical EAP exchange starts with the authenticator sending a Request 
message to the peer. The peer responds with a Response message. Both messages 
use the same format, as shown in Figure 18-6. An overview of the exchange is 
shown in Figure 18-7. 

The primary purpose of the Request and Response messages is to exchange 
whatever information is required to allow an authentication method to succeed. 
Numerous methods are defined within [RFC3748], and several are defined in 
other standards. The particular method being used is encoded in the Type field of 
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Peer Authenticator AAA Server 


(Supplicant) 


(Backend Authentication Server) 


Depends on 
Authentication 
Method 

(e.g., MD5-Challenge, 
EAP-TLS) 



Figure 18-7 The baseline EAP messages carry authentication material between the peer and the 
authenticator. In many deployments, the authenticator is a relatively simple device that 
acts in a "pass-through" mode. In such cases, most of the protocol processing takes 
place on the peer and AAA server. IETF standard AAA-specific protocols such as 
RADIUS or Diameter may be used to encapsulate EAP messages carried between the 
AAA server and authenticator. 


Request and Response messages using values of 4 or greater. Other special Type 
field values include Identity (1), Notification (2), Nak ("Legacy Nak") (3), and an 
Expanded Type extension (254). The Identity type is used by an authenticator 
to ask the peer its identifying information and provide a method for the peer to 
respond. The Notification type is used to display a message or notification to a 
user or log file (not for errors, but for notifications). When a peer does not support 
a method requested by the authenticator, it replies with a negative ACK (either a 
Legacy Nak or an Extended Nak). Extended Naks include a vector of implemented 
authentication methods not present in Legacy Naks. 

EAP is a layered architecture that supports its own multiplexing and demulti¬ 
plexing. Conceptually, it consists of four layers: the lower layer (for which there are 
multiple protocols), EAP layer, EAP peer/authenticator layer, and EAP methods 
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layer (for which there are many methods). The lower layer is responsible for trans¬ 
porting EAP frames in order. Perhaps ironically, some of the protocols used to 
transport EAP are actually higher-layer protocols, many of which we have dis¬ 
cussed already. Examples of EAP "lower-layer" protocols include 802.1X, 802.11 
(802.11i) (see Chapter 3), UDP with L2TP (see Chapter 3), UDP with IKEv2 (see 
Section 18.8.1), and TCP (see Chapters 12-17). Pigure 18-8 shows how the layers are 
implemented in conjunction with a pass-through authenticator. A pass-through 
server would be the opposite but is not supported by RADIUS or Diameter. 


EAP Method 1 

EAP Method 2 ^ 

1 

1 

EAP Peer Layer i 
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EAP Layer | 
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Figure 18.8 The EAP stack and implementation model. In the pass-through mode, the peer and 
AAA server are responsible for implementing the EAP authentication methods. The 
authenticator need only implement EAP message processing, the authenticator process¬ 
ing, and enough of an AAA protocol (e.g., RADIUS, Diameter) to exchange information 
with the AAA server. 


In the "EAP stack" depicted in Eigure 18-8, the EAP layer implements reliabil¬ 
ity and duplicate elimination. It also performs demultiplexing based on the code 
value in EAP packets. The peer/authenticator layer is responsible for implementing 
the peer and/or authenticator protocol messages, based on demultiplexing of the 
Code field. The EAP methods layer consists of all the specific methods to be used for 
authentication, including any required protocol operations to handle large mes¬ 
sages. This is necessary because the rest of the EAP protocol does not implement 
fragmentation and some methods may require large messages (e.g., containing 
certificates or certificate chains). 

18.7.1 EAP Methods and Key Derivation 

Given its architecture, many EAP authentication and encapsulation methods are 
available for use (more than 50). Some are specified by lETP standards, and others 
have evolved separately (e.g., from Cisco or Microsoft). Some of the more common 
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methods include TILS [RFC5281], TLS [RFC5216], FAST [RFC4851], LEAP (Cisco 
proprietary), PEAP (EAP over TLS, Cisco proprietary), IKEv2 (experimental) 
[RFC5106], and MD5. Of these, only MD5 is specified in [RFC3748], but it is no 
longer recommended for use. Unfortunately, the complexity does not end when 
specifying one of these methods alone. Within each method there are sometimes 
different options for cryptographic suites or identity verification. With PEAP, for 
example, some versions of Microsoft Windows support MSCPlAPv2 and TLS. 

The reasons for having so many options are partly historical. As security and 
operational experience have evolved over time, some methods were found to be 
too insecure or insufficiently flexible. Some authentication methods require an 
operating PKI that can provide client certificates (e.g., EAP-TLS), while others (e.g., 
PEAP, TTLS) do not require such infrastructure. Older protocols (e.g., LEAP) were 
designed at a time when other standards such as 802.11 (incorporating 802.11i) 
were not yet mature. Consequently, depending on the particular environment, 
various combinations of smart cards or tokens, passwords, or certificates may be 
required to use EAP. 

The purpose of the EAP methods is to establish authentication, and possibly 
authorization for network access. In some cases (e.g., EAP-TLS), the methods pro¬ 
vide bidirectional authentication, whereby each end acts as both an authenticator 
and a peer. The type of authentication provided by a method is often a conse¬ 
quence of the cryptographic primitives it employs. 

Some methods provide more than authentication. Those that provide key deri¬ 
vation are able to agree upon and export keys in a key hierarchy [REC5247] and 
must provide for mutual authentication between the EAP peer and EAP server. 
The master session key (MSK, also called AAA-key) is used in deriving other keys 
using a KDE, either at an EAP peer or authenticator. MSKs are at least 64 bytes in 
length and are typically used to derive transient session keys (TSKs) that are used to 
enforce access control between a peer and an authenticator, often at lower layers. 
Extended MSKs (EMSKs) are also provided along with MSKs but are made avail¬ 
able only to the EAP server or peer, not to pass-through authenticators, and are 
used in deriving root keys [REC5295]. Root keys are keys associated with particular 
usages or domains. A usage-specific root key (USRK) is a key derived from an EMSK 
in the context with a particular usage. A domain-specific root key (DSRK) is a key 
derived from an EMSK for use in a particular domain (i.e., collection of systems). 
Child keys derived from a DSRK are known as domain-specific usage-specific root 
keys (DSUSRKs). 

During an EAP exchange, multiple peer and server identities may be used, 
and a session identifier is allocated. On completion of an EAP-based authentica¬ 
tion where key derivation is supported, the MSK, EMSK, peer identifier(s), server 
identifier(s), and a session ID are made available to lower layers. (A now-depre¬ 
cated initialization vector might also be provided.) Keys generally have an asso¬ 
ciated lifetime (8 hours is recommended), after which EAP re-authentication is 
required. Eor an in-depth discussion of EAP's key management framework and an 
accompanying detailed security analysis, please see [REC5247]. 
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18.7.2 The EAP Re-authentication Protocoi (ERP) 

In cases where EAP authentication has completed successfully, it is often desirable 
to reduce latency if a subsequenf aufhenficafion exchange is required (e.g., a mobile 
node moves from one access poinf fo anofher). The EAP Re-authentication Protocol 
(ERP) [RPC5296] provides fhe abilify fo do fhis independenf of any parficular EAP 
mefhod. EAP peers and servers fhaf supporf ERP are called ER peers and servers, 
respecfively ERP uses a re-aufhenficafion roof key (rRK) derived from a DSRK (or 
fhe EMSK, buf [REC5295] suggesfs avoiding fhis) along wifh a re-aufhenficafion 
infegrify key (rIK) derived from fhe rRK used fo prove knowledge of fhe rRK. 

ERP operafes in a single round-frip fime, which is consisfenf wifh ifs goal 
of reducing re-aufhenficafion lafency. ERP begins wifh a full convenfional EAP 
exchange, assumed fo be in fhe "home" domain. The MSK generafed is disfrib- 
ufed fo fhe aufhenficafor and peer as usual. However, fhe rIK and rRK values are 
also defermined af fhis fime and shared only befween fhe peer and EAP server. 
These values can be used in fhe home domain, along wifh rMSKs generafed for 
each aufhenficafor. When fhe ER peer moves fo a differenf domain, differenf val¬ 
ues (DS-rIK and DS-rRK, which are DSUSRKs) are used. The domain of fhe ER 
server is confained in a TLV area in ERP messages, allowing peers fo defermine 
fhe domain of fhe server wifh which fhey are communicafing. Defails of fhe pro¬ 
tocol are given in [RPC5296]. 

18.7.3 Protocol for Carrying Authentication for Network Access (PANA) 

While combinations of EAP, 802.1X, and PPP have all been used to support authen¬ 
tication of the client (and network, in some cases), they are not entirely link-inde¬ 
pendent. EAP tends to be implemented for particular links, 802.1X applies to IEEE 
802 networks, and PPP uses a point-to-point network model. To address this con¬ 
cern, the Protocol for Carrying Authentication for Network Access (PANA) has been 
defined in [RPC5191], [RPC5193], and [RPC6345] based on requirements set out in 
[RPC4058] and [RPC4016]. It acts as an EAP lower layer, meaning it acts as a "car¬ 
rier" for EAP information. It uses UDP/IP (port 716) and is therefore applicable to 
more than a single type of link, and it is not limited to a point-to-point network 
model. In effect, PANA allows EAP authentication methods to be used on any 
link-layer technology for determining network access. 

The PANA framework includes three main functional entities: the PANA Cli¬ 
ent (PaC), PANA Authentication Agent (PAA), and the PANA Relay Element (PRE). 
Normal usage also involves an Authentication Server (AS) and Enforcement Point 
(EP). The AS may be a conventional AAA server accessed using access protocols 
such as RADIUS or Diameter. The PAA is responsible for conveying authentica¬ 
tion material from a PaC to the AS, and for configuration of the EP when network 
access is approved or revoked. Some of these entities may be colocated. The PaC 
and associated EAP peer are always colocated, as are the EAP authenticator and 
PAA. A PRE can be used to relay communications between a PaC and PAA when 
direct communication is not otherwise possible. 
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The PANA protocol consists of a set of request/response messages including 
an extensible set of attribute-value pairs managed by the lANA [IPANA]. The pri¬ 
mary payloads are EAP messages, sent in UDP/IP datagrams as part of a PANA 
session. There are four phases in a PANA session: authentication/authorization, 
access, re-authentication, and termination. The re-authentication phase is really a 
portion of the access phase wherein the session lifetime is extended by re-execut- 
ing EAP-based authentication. The termination phase is entered either explicitly 
or as the result of the session timing out (either because of lifetime exhaustion 
or failure of liveness detection). PANA sessions are identified by a 32-bit session 
identifier included in each PANA message. 

PANA also provides a form of reliable transport protocol. Each message 
contains a 32-bit sequence number. The sender keeps track of the next sequence 
number to send, and receivers keep track of the next expected sequence number. 
Answers contain the same sequence number as the corresponding request. Ini¬ 
tial sequence numbers are randomly selected by the sender of the message (i.e., 
PaC or PAA). PANA also implements time-based retransmission. PANA is a weak 
transport protocol—it operates in a stop-and-wait fashion, does not use an adap¬ 
tive retransmission timer, and cannot perform repacketization. It does, however, 
perform exponential backoff on its retransmission timer when faced with multiple 
packet losses. 


18.8 Layer 3 IP Security (IPsec) 

IPsec is an architecture and collection of standards that provide data source 
authentication, integrity, confidentiality, and access control at the network layer 
for IPv4 and IPv6 [RPC4301], including Mobile IPv6 [RPC4877]. It also provides 
a way to exchange cryptographic keys between two communicating parties, a 
recommended set of cryptographic suites, and a method for signaling the use of 
compression. Each communicating party may be an individual host or a security 
gateway (SG) that provides a boundary between a protected and an unprotected 
portion of a network. Thus, IPsec can be used in applications such as remote 
access to a corporate LAN (forming a VPN), to interconnect different portions of 
an enterprise securely across the open Internet, or to secure the communications 
of hosts or routers acting as hosts when exchanging routing information. When 
choosing a security approach for newly developed protocols, IPsec is sometimes 
selected [RPC5406]. 

Pigure 18-9 indicates the types of deployments that can be accomplished using 
IPsec. A host implementation of IPsec may be integrated within the IP stack itself 
or may act as a driver sitting "below" the rest of the network stack (called the 
"Bump in the Stack" or BITS implementation). Alternatively, it may reside inside 
an inline SG, which is sometimes called the "Bump in the Wire" or BITW imple¬ 
mentation approach. Por BITW implementations, both host and SG functionality 
is generally required, as the device typically needs to be managed remotely. This 
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is similar to the reasons we see applications and transport protocols implemented 
in routers that would otherwise be pure layer 3 devices (see Chapter 1). IPsec can 
support multicast communications, but we focus first on the simpler and more 
common unicast case. 



Figure 18-9 IPsec is applicable to securing host-to-host communications, host-to-gateway communications, 
and gateway-to-gateway communications. It also supports multicast distribution and mobility. 


The operation of IPsec can be divided into the establishment phase, where 
key material is exchanged and a security association (SA) is built, followed by the 
data exchange phase, where different types of encapsulation schemes, called the 
Authentication Header (AH) and Encapsulating Security Payload (ESP), may be used 
in different modes such as tunnel mode or transport mode to protect the flow of IP 
datagrams. Each of these IPsec components uses a cryptographic suite, and IPsec 
is designed to support a wide range of suites. A complete IPsec implementation 
includes the SA establishment protocol, AH (optionally), ESP, and a collection of 
appropriate cryptographic suites, configuration information, and setup tools. An 
overview that summarizes the evolution and current specifications for all IPsec 
components is given in [RPC6071]. 

Although an IPsec implementation may be present in a system (it is required 
to be present for IPv6 implementations), IPsec operates only selectively on certain 
packets based on policies set by administrators. The policies are contained in a 
security policy database (SPD), logically resident with each IPsec implementation. 
IPsec also requires two additional databases called the security association database 
(SAD) and peer authorization database (PAD). These are consulted when determin¬ 
ing how packets are to be handled, as illustrated in Pigure 18-10. 

Taking the (somewhat simplified) SG of Pigure 18-10 as an example, particular 
fields of an arriving packet {traffic selectors) are inspected to determine whether 
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Figure 18-10 In a security gateway, IPsec packet processing takes place at layer 3 in a logical entity separating 
a protected and an unprotected network. The security policy database dictates the disposition of 
packets: bypass, discard, or protect. Protection generally involves applying or validating integ¬ 
rity protection or encryption. An administrator configures the SPD to achieve desired security 
goals. 


the arriving packet is using IPsec and has a preexisting SA. If so, processing is 
relatively simple and usually involves applying either ESP or AH, as described 
in Sections 18.8.2 and 18.8.3. If nof, fhe SPD is used fo defermine whaf fype of SA 
should be esfablished, if any, and fhe SAD is populafed fo confain informafion on 
fhe new SA. If a new SA needs fo be esfablished, fhe simplesf way is using some 
aufomafed key esfablishmenf profocol. Alfhough IPsec mandafes fhe supporf of 
manual keying, where keys are simply fyped in by hand, fhis mefhod does nof 
scale well and is error-prone. Therefore, if is expecfed fhaf normally a key esfab¬ 
lishmenf profocol is used in esfablishing SAs. For IPsec, fhe mosf recenf version of 
fhis profocol is whaf we explore nexf. 

18.8.1 Internet Key Exchange (IKEv2) Protocol 

The first step in using IPsec is to establish an SA. An SA is a simplex (one-direction) 
authenticated association established between two communicating parties, or 
between a sender and multiple receivers if IPsec is supporting multicast. Most 
frequently, communication is bidirectional between two parties, so a pair of SAs is 
required to use IPsec effectively A special protocol called the Internet Key Exchange 
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(IKE) is used to accomplish this task automatically. The current version of the pro¬ 
tocol is called IKEv2 [REC5996]. We will refer to it simply as IKE. Note that IKE is 
one of the more complicated pieces of IPsec, so once we understand it, the rest is 
comparatively straightforward. Note, however, that we will discuss only the major 
points of how IKE operates as a protocol. Eor particular details, such as the myriad 
cryptographic suites and configuration parameters supported, the reader should 
consult [REC5996] directly. 

To establish an SA, IKE begins with a simple request/response message pair 
that includes a request to establish the following parameters: an encryption algo¬ 
rithm, an integrity protection algorithm, a Diffie-Hellman group, and a PRE that 
gives a random-appearing output given any input bit string. In IKE, a PRE is used 
for generation of session keys. IKE first establishes an SA for itself (called an IKE_ 
SA) and can subsequently establish SAs for either AH or ESP (called CHILD_SAs). 
IKE is also capable of negotiating the use of IP Payload Compression (IPComp) 
[REC3173] with each CHILD_SA, because applying compression at other layers 
after performing encryption is ineffective. We discuss the details of AH and ESP 
in Sections 18.8.2 and 18.8.3. 

IKE operates using pairs of messages called exchanges that are sent between 
an initiator and a responder. The first two exchanges, called IKE_SA_INIT and 
IKE_AUTH, establish an IKE_SA and a single CHILD_SA. Subsequently, CREATE_ 
CHILD_SA exchanges, used to establish additional CHILD_SAs, and INEORMA- 
TIONAL exchanges, used to initiate changes in or gather status information about 
an SA, may occur. In most cases, a single IKE_SA_INIT and IKE_AUTH exchange 
(a total of four messages) is sufficient. Messages used in an exchange contain pay- 
loads identified by type numbers that identify the type of information carried in 
each payload. Multiple payloads per message are common, and some long mes¬ 
sages may require IP fragmentation. 

IKE messages are sent encapsulated in UDP using port number 500 or 4500. 
However, because IKE traffic may pass through a NAT where the port number is 
rewritten, an IKE receiver should be prepared to receive traffic originating from 
any port. Port 4500 is reserved for UDP-encapsulated ESP and IKE [REC3948]. IKE 
messages appearing on port 4500 are required to have their initial 4 data bytes set 
to 0 (the "non-ESP marker") to differentiate them from other (i.e., ESP or WESP) 
messages. 

IKE initiators perform timer-based retransmissions when IKE messages 
appear to have been lost. Responders perform retransmissions only when trig¬ 
gered by an incoming request. An exponentially increasing retransmission timer 
is used for retransmissions, but the total number of retransmissions is left unspec¬ 
ified. Both initiators and responders keep track of their last transmitted messages 
and corresponding sequence numbers. Sequence numbers are used to match 
requests with responses, and to identify message retransmissions. This makes 
IKE a window-based protocol with a maximum window size given by a responder 
that is initialized when an SA is first set up but can be increased later. The maxi¬ 
mum window size limits the total number of outstanding requests. 
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18.8.1.1 IKEv2 Message Formats 

IKE messages contain a header followed by zero or more IKE payloads. The header 
structure is shown in Figure 18-11. 


Used as a I 
Connection 
Identifier I 


0 15 31 


IKE_SA Initiator’s SPI (64 bits) 


IKE_SA Responder’s SPI (64 bits) 


Next Payload (8 bits) 

Maj Version 
(4 bits) 

Min Version 
(4 bits) 

Exchange Type (8 bits) 

Flags (8 bits) 


Message ID (32 bits) 


Length (32 bits) 


(Payloads Follow the IKE Header) Flags word 

X(0) 
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X(0) 

(3 bits) 

R: Respons 
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1: Initiator 


V: Version 

(Higher Major Version Number Supported) 

Figure 18-11 The IKE v2 header. All IKE messages contain a header followed by zero or more payloads. IKE 
uses 64-bit SPI values. The Exchange Type gives the purpose of the exchange and the payloads 
that may be expected in the message. The Flags field indicates whether the message was sent 
from an initiator or a responder. The Message ID associates requests with responses and is used 
for detecting replay attacks. 


In the headers of IKE messages, as shown in Figure 18-11, the Security Param¬ 
eter Index (SPI) is a 64-bit number that identifies a particular IKE_SA (other IPsec 
protocols use a 32-bit SPI value). Both the initiator and the responder have an 
SA for their peer, so each provides the SPI it is using, and this pair of values, 
combined with the IP addresses of the endpoints, can be used to form an effec¬ 
tive connection identifier. The Next Payload field is discussed later in this section. 
The Major Version and Minor Version fields are set to 2 and 0, respectively, for 
this version of IKE. The major version number is changed when interoperability 
cannot be maintained between versions. The Exchange Type field gives the type 
of exchange of which the message is part: IKE_SA_INIT (34), IKE_AUTH (35), 
CREATE_CHILD_SA (36), INFORMATIONAL (37), and IKE_SESSION_RESUME 
(38; see [RFC5723]). Other values are reserved; the range 240-255 is reserved for 
private use. Three bit fields are defined for the Flags field (bits are labeled right 
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to left, starting from 0): I {Initiator, bit 3), V {Version, bit 4), and R {Response, bit 5). 
The 1 bit field is set by the original initiator and cleared by the recipient for return 
messages. The V bit field indicates that the sender supports a higher major version 
number of the protocol than is currently being used. The R bit field indicates that 
the message is a response to a previous message using the same message ID. 

The Message ID field in IKE acts somewhat like the Sequence Number field in 
TCP (see Figure 12-3 in Chapter 12), except the message ID starts with 0 for the 
initiator and 1 for the responder. The field is incremented by 1 for each subsequent 
transmission, and responses use the same message ID as the requests. The I and 
R bit fields differentiate requests from responses. Message IDs are remembered 
when sent or received. Doing so allows each end to perform replay detection. Old 
message IDs are not processed. Wrapping of the Message ID field (possible, but not 
likely with 4 billion IKE messages) is handled by reinitiating the IKE_SA_INIT 
exchange. 

The other fields {Next Payload and Length) help describe what the IKE message 
contains. Each message contains zero or more payloads, and each payload has its 
own particular structure. The Length field gives the size (in bytes) of the header 
plus all payloads in the message. The Next Payload field gives the type of the fol¬ 
lowing payload. At present, 16 nontrivial types are defined (value 0 indicates no 
next payload), as shown in Table 18-2. The official current list can be found in 
[IKEPARAMS], which contains all standardized field values for IKEv2. 


Table 18-2 IKEv2 payload types. A value of 0 indicates no next payload. 


Value 

Notation 

Purpose 

Value 

Notation 

Purpose 

33 

SA 

Security association 

41 

N 

Notify 

34 

KE 

Key exchange 

42 

D 

Delete 

35 

IDi 

Identification 

(initiator) 

43 

V 

Vendor ID 

36 

IDr 

Identification 

(responder) 

44 

TSi 

Traffic selector (initiator) 

37 

CERT 

Certificate 

45 

TSr 

Traffic selector (responder) 

38 

CERTREQ 

Certificate request 
(indicates trust 
anchors) 

46 

SKI) 

Encrypted and authenticated 
(contains other payloads) 

39 

AUTH 

Authentication 

47 

CP 

Configuration 

40 

Ni,Nr 

Nonces (initiator, 
responder) 

48 

EAP 

Extensible authentication 
(EAP) 


The ranges 1-32 and 49-255 are reserved; the range 128-255 is reserved for 
private use. Each IKE payload begins with an IKE generic payload header, shown in 
Figure 18-12. 
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Figure 18-12 A "generic" IKEv2 payload header. Each payload begins with a header of this form. 


The generic payload header is fixed at 32 bits, and the Next Payload and Payload 
Length fields provide for a "chain" of variable-size payloads (up to 65,535 bytes 
each, including the 4-byte payload header) to be present in a single IKE message. 
Each payload type has its own set of special headers. The C (critical) bit field indi¬ 
cates that the current payload (not the one identified by the Next Payload field) is 
deemed "critical" for a successful IKE exchange. Receivers of critical payloads that 
do not understand the type code (provided in the previous payload's Next Payload 
field or in the IKE header's Next Payload field) must abort the IKE exchange. Note 
that this capability provides the ability to create new payload types that may not 
be understood by all implementations. 

18.8.1.2 The IKE_SA_INIT Exchange 

To get a better idea of how IKE operates, we will start by describing the IKE_SA_ 
INIT exchange. It is the first of two exchanges, IKE_SA_INIT and IKE_AUTH, 
constituting the "initial exchanges" of IKE shown in Eigure 18-13. The initial 
exchanges were formerly known as Phase 1 in earlier versions of IKE. Other 
exchanges (CREATE_CHILD_SA and INEORMATIONAL) may be initiated by 
either party only after the initial exchanges have completed, and they are always 
secured (encrypted and integrity-protected) based on the parameters established 
using the first two exchanges. 

As shown in Eigure 18-13, IKE_SA_INIT negotiates the choice of crypto¬ 
graphic suite, exchanges nonces, and performs a DH key agreement. It may also 
include additional information, depending on the particular implementation and 
deployment scenario. It begins when the initiator sends an IKE message contain¬ 
ing its set of supported cryptographic suites, DH information, and nonce using 
three payloads (SA, KE, and Ni). Details of each payload type are given in Section 
3 of [REC5996], and we discuss some of them in Section 18.8.1.3; note that in some 
implementations additional payloads are also included. A lack of response to this 
message triggers retransmissions at the initiator. 

Upon receiving the first message, the responder becomes aware that an IKE 
transaction is requested by the initiator, the initiator's supported cryptographic 
suites, and configuration parameters. The responder selects an acceptable crypto¬ 
graphic suite and expresses this in the SArl payload (see Section 18.8.1.3). It also 
provides its portion of the DH key agreement parameters in KEr, its nonce in Nr, 
and an optional request for the initiator's certificate in the CERTREQ payload. 
CERTREQ payloads include an indication of CAs the responder finds acceptable 
for validating certificates that may be used in subsequent exchanges (i.e., it indi¬ 
cates the responder's trust anchors). A message containing the responder's IKE 
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Figure 18-13 The IKE_SA_INIT and IKE_AUTH exchange involves payloads used to establish the 
first two security associations (IKE_SA and one CHILD_SA). Certificates and certifi¬ 
cate request payloads (with trust anchors) may also be included, as may Notification 
and Configuration payloads (not shown). 


header and all of these payloads is then sent in response to the initiator, complet¬ 
ing the IKE_SA_IN1T exchange. In some implementations, extra payloads (e.g.. 
Notify and Configuration payloads; see Section 18.8.1.5) are also included. To bet¬ 
ter understand how IKE_SA_INIT operates, we shall begin by discussing its most 
important payloads: SA, KE, Ni, and Nr. 

18.8.1.3 Security Association (SA) Payloads and Proposais 
SA payloads contain an SPI value and a set of proposals (often one). Proposals are 
built using proposal structures that are somewhat complex. Each proposal struc¬ 
ture is numbered and contains an IPsec protocol ID. A protocol ID indicates one 
of the following IPsec protocols: IKE, AH, or ESP (see Sections 18.8.2 and 18.8.3). 
Multiple proposal structures using the same proposal number are considered 
to be part of the same proposal (an "AND" of the specified protocols). Proposal 
structures with different proposal numbers are considered different proposals (an 
"OR" of the specified protocols). 

Each proposal/protocol structure contains one or more transform structures 
that describe algorithms to be used with the specified protocols. Typically, AH 
has a single transform (integrity check algorithm), ESP has two (integrity check 
and encryption algorithms), and IKE has four (DH group number, PRE, integrity 
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check, and encryption algorithms). Combined encryption/integrity algorithms 
(e.g., authenticated encryption algorithms) are expressed solely as encryption 
algorithms with no separate integrity protection specification. A special extended 
sequence number "transform," which is really just a Boolean value, indicates 
whether sequence numbers used with the SA (i.e., for AH or ESP) should be com¬ 
puted using 32 or 64 bits. 

If there are multiple transforms of the same type, the proposal is the union of 
the transforms (i.e., any are acceptable). If there are multiple transforms with dif¬ 
ferent types, the proposal is the intersection. An individual transform may have 
zero or more attributes. These are necessary when a transform can be used in more 
than one way (e.g., a transform capable of processing keys of differing lengths 
would have an associated attribute with the particular key length to be used for 
the proposal). Most transforms do not require attributes, but the relatively com¬ 
mon AES encryption transform does. 

18.8.1.4 Key Exchange (KE) and Nonce (Ni, Nr) Payloads 

In addition to SA payloads, IKE_SA_INIT messages include a KE (Key 
Exchange) and Nonce payload (written as Ni, Nr, or sometimes No). The KE pay- 
load contains the DH group number and key exchange data representing the 
public numbers used in forming an ephemeral Diffie-Hellman key (initial shared 
secret). The DH group number gives the group in which the public value was 
computed. The Nonce payload contains a recently generated nonce between 16 
and 256 bytes in length. It is used in generating key material to ensure freshness 
and protect against replay attacks. 

Once the DH exchange completes, each side can compute its SKEYSEED value, 
which is used for all subsequent key generation associated with the IKE_SA (unless 
a key-generating EAP method is used for this purpose; see Section 18.8.1.9), a total 
of seven secret values: SK_d, SK_ai, SK_ar, SK_ei, SK_er, SK_pi, and SK_pr. These 
values are computed as follows: 

SKEYSEED = prf(Ni I Nr, g^ir) 

{SK_d I SK_ai I SK_ar I SK_ei I SK_er I SK_pi I SK_pr} = 
prf+ (SKEYSEED, Ni I Nr I SPIi I SPIr) 

Here, I is the concatenation operator. The cascading PRP function prf+ (K,S) = 
T1 I T2 I ..., whereT1 = prf(K,S10x01),T2 = prf(K,Til SI 0x03),T3 = prf(K,T2I SI0x03), 
T4 = prf(K, T31 SI 0x04),... . The value g^ir is the shared secret established during 
the DH exchange. Ni and Nr are nonces (stripped of any payload headers). Note 
that each direction of each SA uses different keys, which explains why so many 
keys are required. The SK_d key is used for deriving keys for CHILD_SAs. The 
SK_a and SK_e keys are for authentication and encryption, respectively. The SK_p 
keys are used in generating AUTH payloads during the IKE_AUTH exchange. 
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18.8.1.5 Notification (N) and Configuration (CP) Payloads 

The N payload is a Notification or Notify payload. Although this type of payload 
is not shown in Figure 18-13, we shall see it used in the examples later. It can be 
used for conveying error messages and indications of various processing capabili¬ 
ties with most of the IKE exchange types. It contains a variable-length SPI field 
and a 16-bit field to indicate the notification type [IKEPARAMS]. Values below 
8192 are used for standard errors, and values above 16383 are used for status indi¬ 
cators. Eor example, when requesting the creation of a transport mode SA instead 
of the default tunnel mode, a Notify payload containing the USE_TRANSPORT_ 
MODE value (16391) is used. If IP compression [RPC3173] is supported, this fact 
can be indicated by the IPCOMP_SUPPORTED value (16387). If Robust Header 
Compression (ROHC) [RPC5857] is supported, this can be indicated using the 
ROHC_SUPPORTED value (16416), which also includes ROHC parameters used 
to establish a so-called ROHCoIPsec SA. A desire to use the "wrapped ESP" mode 
(see Section 18.8.3.2) is indicated using the USE_WESP_MODE value (16415). Notify 
payloads may contain a variable-length data portion whose content depends on 
the notification type. 

A CP or Configuration payload also contains additional information like a 
Notify payload but is used primarily for initial system configuration. Eor example, 
obtaining information that might ordinarily be conveyed using DHCP (see Chap¬ 
ter 6) can be carried over IKE using a CP. Configuration payloads are of the fol¬ 
lowing major types: CPG_REQUEST, CPG_REPLY, GPG_SET, and GPG_AGK. GPs 
use attribute-value (ATV) pairs that contain a variable-length associated data area. 
Some 20 ATV pairs are defined [IKEPARAMS]. Most involve methods to learn 
about IPv4 or IPv6 addresses, subnet masks, or DNS server addresses. IPv6 con¬ 
figuration requires special attention because of the way IPv6 ordinarily employs 
IGMPv6 for stateless autoconfiguration and Neighbor Discovery (see Ghapter 8). 
An experimental specification [RPG5739] explores how IKEv2 can be used in con¬ 
figuring an IPv6 node across an IPsec association in a VPN configuration. 

18.8.1.6 Algorithm Selection and Application 

IKE divides the set of transforms forming a cryptographic suite into four types: 
encryption (type 1, used with IKE and ESP), PRP (type 2, used with IKE), integrity 
protection (type 3, used with IKE and AH and optional in ESP), and DH group 
(type 4, used with IKE and optional in AH and ESP). Although IKE is capable of 
negotiating which particular cryptographic suite is to be used for each direction of 
an SA, support for a baseline set of algorithms (transforms) is deemed mandatory 
for any implementation. In addition, several algorithms have been chosen as rec¬ 
ommended, with the strong possibility that they will be mandatory in the future. 
These algorithms are provided in [REG4307] (see Table 18-3). 

The I ANA also keeps an official registry of values [IKEPARAMS], and 
although the list here includes the mandatory algorithms at the time of writing, 
many other algorithms, groups, and techniques have been proposed and pub¬ 
lished, including options for EGG-based digital signatures (see [REG4754]). 
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Table 18-3 Mandatory-to-implement algorithms for use with IKEv2, grouped by type number 


Purpose 

Name 

Number 

Status 

Original 
Defining RFC/ 
Reference 

IKE Transform 

ENCR_3DES 

3 

Required 

[RFC2451] 

Type 1 

ENCR NULL 

11 

Optional 

[RFC2410] 

(encryption) 

ENCR_AES_CBC 

12 

Recommended 

[RFC3602] 


ENCR_AES_CTR 

13 

Recommended 

[RFC3686] 

IKE Transform 

PRE_HMAC_MD5 

1 

Optional 

[RFC2104] 

Type 2 

PRE_HMAC_SHA1 

2 

Required 

[RFC2104] 

(for PREs) 

PRP_AES128_CBC 

4 

Recommended 

[RFC4434] 

IKE Transform 

AUTH_HMAC_MD5_96 

1 

Optional 

[RFC2403] 

Type 3 (integrity) 

AUTH_HMAC_SHA1_96 

2 

Required 

[RFC2404] 


AUTH_AES_XCBC_96 

5 

Recommended 

[RFC3566] 

IKE Transform 

1024 MODP (Group 2) 

2 

Required 

[RFC2409] 

Type 4 
(DH groups) 

2048 MODP (Group 14) 

14 

Recommended 

[RFC3526] 


18.8.1.7 The IKE_AUTH Exchange 

As mentioned earlier, the SKEYSEED value is used to derive encryption and 
authentication keys that are in turn used to secure payloads during the 1KE_ 
AUTH exchange. These keys are called SK_e and SK_a, respectively. The notation 
SK{P1, P2, ..., PN] indicates that payloads PI, ..., PN are encrypted and integrity- 
protected using these keys. The primary purpose of the IKE_AUTH exchange is to 
provide identity validation for each peer. It also exchanges sufficient information 
to establish the first CHILD_SA. 

To begin the IKE_AUTH exchange, the initiator sends the payload SK{IDi, 
AUTH, SAi2, TSi, TSr}. Given the proper decryption key, it provides the initiator's 
identity, authentication information validating the initiator's identity, another SA 
payload for the first CHILD_SA called SAi2, and a pair of traffic selectors (payloads 
TSi and TSr, discussed in Section 18.8.1.8). The initiator may also include its certifi¬ 
cate in a CERT payload, a certificate request in a CERTREQ payload that identifies 
its trust anchors, and identification of the responder in the IDr payload. Sending 
the responder's identity is useful in the case where the responder has multiple 
identities associated with the same IP address and needs to ensure that the proper 
SA is set up. Several different identity types are supported for ID payloads, includ¬ 
ing IP address, PQDN, e-mail address, and distinguished name (to be used with 
X.509 certificates). The various types are maintained in the IKEv2 Identification 
Payload ID Types registry [IKEPARAMS]. 

The final message of the exchange includes the responder's identity (IDr), 
authentication material to prove the responder's identity (AUTH), the other SA 
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constituting the CHILD_SA (SAr2), and a set of traffic selectors (TSi and TSr), 
which may be subsets of the original TSi and TSr values. All payloads in the 
IKE_AUTH exchange are encrypted and integrity-protected. A certificate payload 
(CERT) containing one or more certificates may also be sent at this point. If so, 
any public key required to validate the AUTH payload appears first in the certifi¬ 
cate list. The specific contents vary depending on the cryptographic suite selected. 
During the exchanges, both sides must check all applicable signatures in order to 
be safe from compromise, including MITM attacks. 

18.8.1.8 Traffic Selectors and TS Payloads 

Traffic selectors indicate the fields and corresponding values of an IP datagram 
that cause it to be "selected" for IPsec processing. They are used in combination 
with an IPsec SPD to determine whether the containing datagram should be pro¬ 
tected using IPsec. As mentioned previously, datagrams that are not protected are 
either bypassed or dropped by IPsec processing. 

The contents of a TS payload may include IPv4 or IPv6 address ranges, port 
number ranges, and an IPv4 protocol ID or IPv6 header value. Ranges are some¬ 
times denoted with wildcard notation. Por example, the notation 192.0.2.”^ or 
192.0.2.0/24 would represent the range 192.0.2.0-192.0.2.255. Traffic selectors can 
be used to help implement policies such as which cryptographic suite is required 
to establish an SA to a particular host or port range. Most of these details are han¬ 
dled in the management interface to the SPD. During an IKE_AUTH exchange, 
each party specifies a TSi and TSr payload containing TS values. When one range 
is smaller than another, the smaller range is selected for use in a process called 
"narrowing." 

18.8.1.9 EAPandIKE 

Although IKE includes its own authentication methods (see Section 2.15 of 
[REC5996]), it can also make use of EAP (see Sections 2.16 and 3.16 of [REC5996]). 
With EAP, a wide array of authentication methods can be used beyond the rela¬ 
tively limited set of pre-shared keys or public key certificates otherwise required 
by IKE. Indeed, these limited sets of options for keying are one reason for the 
relatively limited success of IPsec more generally. 

A desire to use EAP is indicated by omitting the first AUTH payload from 
the IKE_AUTH exchange in message 3 (Pigure 18-1). By including the IDi pay- 
load but no AUTH payload, the initiator asserts an identity but does not prove it. 
If EAP is acceptable, the responder returns an EAP payload and defers sending 
the SAr2, TSi, and TSr payloads until the EAP-based authentication is complete. 
This happens once the initiator has finally sent an EAP-acceptable AUTH payload 
that can be verified by the responder after one or more EAP payloads have been 
exchanged. 

One issue regarding EAP with IKE involves a possible inefficiency due to 
double authentication. In particular, older EAP methods provided only one direc¬ 
tion of authentication (peer to authenticator), so IKE requires certificate-based 
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authentication to perform authentication in the other direction. Recognizing that 
deploying the necessary key infrastructure is sometimes difficult, and that newer 
EAP methods support mutual authentication and key derivation, [RFC5998] pro¬ 
vides a way to use only EAP for authentication. Using an EAP_ONLY_AUTHEN- 
TICATION Notification payload sent by the initiator, the responder is able to 
suppress sending the AUTH and CERT payloads carried in message 4 (in Fig¬ 
ure 18-1). In this case, subsequent AUTH payloads use the key generated by EAP 
instead of SK_pi and SK_pr. 

Performing EAP-only authentication relies on EAP methods that are suffi¬ 
ciently secure so as to obviate the need for IKE authentication. These are called 
safe EAP methods. To be safe, an EAP method must provide mutual authentica¬ 
tion, be capable of generating keys, and be resistant to dictionary attacks. Some 13 
methods are given in [RFC5998], including EAP-TLS, EAP-FAST, and EAP-TTLS, 
that are believed to be safe. 

18.8.1.10 Better-than-Nothing Security (BTNS) 

A relatively recent development with IKE and IPsec is called better-than-nothing 
security (BTNS, pronounced "buttons"). BTNS aims to address some of the usabil¬ 
ity and ease of deployment issues with IPsec, especially the need to establish a 
PKI or other deployed authentication system [RFC5387] to use certificates. Tech¬ 
nically, BTNS is essentially unauthenticated IPsec [RFC5386], and it can be sup¬ 
ported when IKE is used to establish an SA. With BTNS, public keys are used, 
but their containing certificates are not checked against a chain or root certificate. 
Consequently, an SA can ensure that the same entity is communicating over time 
but cannot ensure that any particular, validated entity established the SA. This 
form of authentication is called continuity of association and is weaker than the data 
origin authentication present in ordinary IPsec. BTNS makes no other substantive 
changes to IPsec; the formats of IKE, AH, and ESP messages remain the same. 

18.8.1.11 The CREATE_CHILD_SA Exchange 

The CREATE_CHILD_SA exchange is used to create CHILD_SAs for ESP or AH, 
or to rekey existing SAs (either IKE_SAs or CHILD_SAs) once the initial exchanges 
have completed. It uses a single exchange of packets and may be initiated by either 
side of the IKE_SA established during the initial exchanges. There are two vari¬ 
ants, depending on whether a CHILD_SA or IKE_SA is being modified. Figure 
18-14 shows the variants, where the initiator is the entity initiating the CREATE_ 
CHILD_SA exchange and not necessarily the original initiator of the IKE_SA. 

In Figure 18-14, the first exchange depicts a CREATE_CHILD_SA used to cre¬ 
ate a new CHILD_SA or rekey an existing one. Rekeying is indicated by the pres¬ 
ence of an N(REKEY_SA) Notification payload sent by the initiator. To complete 
the rekey operation, a new SA is first created, and the old one is subsequently 
deleted (see the next section). The new SA and traffic selector (TS) information 
allows most of the connection parameters to be altered. If desired, new DH val¬ 
ues can also be exchanged at this point using KE payloads. This provides better 
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Figure 18-14 The CREATE_CHILD_SA exchange can be used to create or rekey a CE[ILD_SA, or to rekey an 
IKE_SA. A Notification payload is used when modifying a C]-[ILD_SA to indicate the SPI of the 
SA to modify. 


forward secrecy for fhe new SA. Rekeying an IKE_SA uses a similar exchange, 
excepf fhe KE payloads are required and fhe TS payloads are nof used, as shown 
in fhe second parf of Eigure 18-14. 

18.8.1.12 The INFORMATIONAL Exchange 

The INEORMATIONAL exchange is used for conveying sfafus and error informa- 
fion, usually using Nof if y (N) payloads. If is also used for delefing SAs using a 
Delefe (D) payload and fherefore consfifufes one porfion of fhe SA rekeying pro¬ 
cedure. The exchange is shown in Eigure 18-15. 

An INEORMATIONAL exchange can fake place only affer successful comple- 
fion of fhe inifial exchanges. If includes an optional set of notifications. Delete (D) 
payloads that specify SAs to delete by SPI value, and Configuration (CP) payloads. 
Some response is always required for any message received from an initiator, 
even if it is an empty IKE message (i.e., contains only a header). Otherwise, the 
initiator would retransmit its message unnecessarily. In unusual cases, INEOR¬ 
MATIONAL messages may be sent outside the context of an INEORMATIONAL 
exchange, usually to signal the receipt of an IPsec message containing an unrec¬ 
ognized SPI value or unsupported IKE major version number. 
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Figure 18-15 The INFORMATIONAL exchange is used to convey status information and delete SAs. 
It makes use of Notification (N), Delete (D), and Configuration (CP) payloads. 


18.8.1.13 Mobile IKE (MOBIKE) 

Once the IKE_SA has been established, it is ordinarily used until no longer 
required. However, when IPsec operates in an environment where IP addresses 
may change because of mobility or interface failure, a variant of IKE has been 
specified in [REC4555] called MOBIKE. MOBIKE augments the basic IKEv2 proto¬ 
col to include additional "address change" options available in INEORMATIONAL 
exchanges. MOBIKE specifies what to do when the changed addresses are known. 
It does not address the discovery problem of how to determine these addresses. 

18.8.2 Authentication Header (AH) 

Defined in [REC4302], the IP Authentication Header (AH), one of the three major 
components of IPsec, is an optional portion of the IPsec protocol suite that pro¬ 
vides a method for achieving origin authentication and integrity (but not confi¬ 
dentiality) of IP datagrams. By providing only integrity and not confidentiality 
(and not working with NAT; see the remainder of this section), AH is the (far) 
less popular of the two primary IPsec data-securing protocols. In transport mode, 
AH uses a header placed between the layer 3 (IPv4, IPv6 base, or IPv6 extension) 
header and the following protocol header (e.g., UDP, TCP, ICMP). With IPv6, AH 
may appear immediately before a Destination Options extension header, if pres¬ 
ent. In tunnel mode, the "inner" IP header carries the original IP datagram, con¬ 
taining the ultimate IP source and destination information, and a newly created 
"outer" IP header contains information describing the IPsec peers. In this mode, 
AH protects the entire inner IP datagram. Generally speaking, transport mode 
is used between end hosts that are directly connected, and tunnel mode is used 
between SGs or between a single host and an SG (e.g., for supporting a VPN). The 
IPv4 and IPv6 encapsulations for transport-mode AH, using TGP as an example, 
are shown in Pigure 18-16. 
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Figure 18-16 The IPsec Authentication Header is used to provide authentication and integrity protection for IPv4 and IPv6 datagrams. In 
transport mode (depicted here with TCP), a conventional IP datagram is modified to include the AH. 
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In the figure, the IPv4 encapsulation uses a special IPv4 protocol number (51). 
For IPv6, the AH is placed between the destination and other options. In either 
case, the resulting datagram has a mutable portion of its header and an immuta¬ 
ble portion of its header. The mutable portion is changed as the datagram moves 
through the network. Modifications include changing the IPv4 TTL or IPv6 Hop 
Limit field, IPv6 Flow Label field, DS Field, and ECN bits. The immutable portion, 
containing the source and destination IP addresses, is not changed by the net¬ 
work and is integrity-protected using fields in the AH. This prevents transport 
mode AH datagrams from being rewritten by NATs, a potential problem for many 
deployments. Transport mode cannot be used with fragments (IPv4 or IPv6). 

An alternative to transport mode is AH tunnel mode, shown in Figure 18-17. 
In this mode, the original datagram is untouched and instead is inserted inside an 
integrity-protecting new IP datagram. 

In tunnel mode, the entire original IP datagram is encapsulated and protected 
with the AH. The "inner" header is unmodified, and the "outer" header is created 
using the source and destination IP addresses associated with an SG or host. In 
such cases, AH protects all of the original datagram, plus some portions of the 
new header (which prevents it being modified by a NAT). 

Both modes of AH use the same AH shown in Figure 18-18. It identifies the 
datagram length and associated SA and includes integrity check information 
The Payload Length specifies the length of the AH in 32-bit-word units minus 2. 
The Security Parameters Index (SPI) field contains a 32-bit identifier of an SA at 
the receiver that contains SA-derived information relating to the association. For 
multicast SAs, the SPI value is handled in a special way (see Section 18.8.4). The 
Sequence Number is a 32-bit field that increments by 1 for each packet sent on the 
SA. This field is used for replay protection if enabled by the receiver (but it is 
always included by the sender, even if not checked by the receiver). An extended 
sequence number (ESN) operating mode is also defined and recommended and is 
negotiated during the IKE_SA_INIT exchange. If enabled, the sequence number 
is calculated using 64 bits, but only the lower-order 32 bits are included in the 
Sequence Number field. The length of the Integrity Check Value (ICV) field is vari¬ 
able and depends on the cryptographic suite used. This field is always an integral 
multiple of 32 bits in length. 

The algorithm used for integrity protection is specified in the correspond¬ 
ing SA as a type 3 transform and can be established manually or by using some 
automatic method such as IKE. The optional, recommended, and mandatory algo¬ 
rithms for AH (and ESP, later) are provided in [RFC4835] and include HMAC- 
MD5-96 (optional), AES-XCBC-MAC-96 (recommended), and HMAC-SHAl-96 
(mandatory). The integrity check is computed over the following portions of the 
datagram: header fields before the AH that are either immutable in transit or pre¬ 
dictable in value when arriving at the destination AH SA endpoint, the AH, every¬ 
thing after the AH, high-order bits of the ESN (if employed, even though they are 
not sent), plus any padding. 
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Figure 18-17 The IPsec tunnel mode AH encapsulations provide authentication and integrity protection for IPv4 and IPv6 datagrams. In 
tunnel mode (depicted here carrying TCP), a conventional IP datagram is encapsulated inside a new "outside" IP datagram 
that carries the original datagram. 
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Figure 18-18 The IPsec AH is used to provide authentication and integrity protection for IPv4 and IPv6 data¬ 
grams in either transport or tunnel mode. The SPI value indicates which SA the AH belongs to. 
The Sequence Number field is used for countering replay attacks. The ICV provides a form of MAC 
over the immutable portions of the payload. 


Some controversy has arisen over the disposition of mutable fields such as the 
ECN bits used to signal incipient congestion (see Chapters 5 and 16) when tunnel 
modes are used. In [RFC4301], such fields are simply copied to the correspond¬ 
ing fields present in the newly created "outer" IP header. In [RFC6040], however, 
normal mode and compatibility mode for tunnel encapsulation are defined. In normal 
mode, the CE and ECT bit fields are copied to the new header on encapsulation. In 
compatibility mode, the bits are cleared, producing an "outer" packet indicating 
a non-ECN-capable transport. During decapsulation, if the outer or inner header 
contains a CE indication, the indication is copied to the packet produced after 
decapsulation unless the original packet did not indicate ECT (in which case the 
packet is dropped). In addition, if ECT is indicated by either the outer or inner 
headers, ECT is set to true in the decapsulated packet. 

18.8.3 Encapsulating Security Payload (ESP) 

The ESP protocol of IPsec, defined in [RFC4303] (where it is called ESP (v3) even 
though ESP provides no formal version numbers), provides a selectable combina¬ 
tion of confidentiality, integrity, origin authentication, and anti-replay protection 
for IP datagrams. It can employ a NULL encryption method [RFC2410], which is 
mandatory to support, if only integrity is to be used. Conversely, encryption can 
be used for confidentiality without integrity protection, although this combina¬ 
tion is effective only against passive attacks and is highly discouraged. In the con¬ 
text of ESP, integrity includes data origin authentication. Given its flexibility and 
feature set, ESP is (far) more popular than AH. 

18.8.3.1 Transport and Tunnel Modes 

Like AH, ESP has transport and tunnel modes. In tunnel mode, an "outer" IP 
packet includes an "inner" IP packet that may be entirely encrypted. This pro¬ 
vides for a limited form of traffic flow confidentiality (TEC) because the "inner" 
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datagram's size and contents can be hidden using encryption. ESP may be used 
in combination with AH, if desired, and supports both IPv4 and IPv6. Using ESP 
in "integrity-only" mode may be preferable to AH in some cases for performance 
reasons (ESP may be more amenable to pipelining) and is a required configuration 
option for IPsec implementations. The encapsulations for ESP transport mode are 
shown in Pigure 18-19. 

The transport mode structure is similar to AH transport mode, except ESP 
trailer structures are used in support of ESP's encryption and integrity protection 
methods (see Section 18.8.3). As with AH, ESP transport mode cannot be used 
with fragments. The tunnel mode encapsulations for ESP, similar to those for AH, 
are shown in Pigure 18-20. 

ESP does not use a strict header in the same way AH does. Instead, there 
is an overall ESP structure that includes a header and trailer portion. There is 
an optional (second) trailer structure if ESP is used with an integrity protection 
mechanism that requires space for additional check bits (labeled ESP ICV). The 
ESP structure is shown in Pigure 18-21. 

ESP-encapsulated IP datagrams use the value 50 in the Protocol (IPv4) or Next 
Header (IPv6) header fields. The ESP payload structure, shown in Pigure 18-21, 
includes the SPI and sequence numbers, used in the same way as with AH. The 
primary difference is in the payload area. This area may be confidentiality-pro¬ 
tected (encrypted) and can include a variable-length pad portion required by 
some encryption algorithms. 

The payload is required to end on a 32-bit boundary (64 for IPv6) and have the 
last two 8-bit fields identify the Pad Length and Next Header (Protocol) field values. 
The Pad, Pad Length, and Next Header fields constitute the ESP trailer shown in Pig- 
ures 18-19 and 18-20. Certain cryptographic algorithms may employ an IV. If pres¬ 
ent, the IV appears at the beginning of the payload area (not shown). Additional 
padding for TEC purposes (called TFC padding) is permitted to appear within the 
payload area in front of the ESP trailer (see Pigure 2 of [RPC4303] for details). It is 
used to disguise the length of the datagram to help resist traffic analysis attacks, 
although this features does not appear to be widely used. The Next Header field 
contains values chosen from the same space used in the IPv4 Protocol field or IPv6 
Next Header field (e.g., 4 for IPv4,41 for IPv6). It may contain the value 59, indicat¬ 
ing "no next header," when carrying a dummy packet that is to be discarded. 
Dummy packets are another method sometimes used for resisting traffic analysis 
attacks. 

The ESP ICV is a variable-length trailer used if integrity support is enabled 
and required by the integrity-checking algorithm. It is computed over the ESP 
header, payload, and ESP trailer. Implicit values (e.g., high-order ESN bits) are also 
included. The length of the ICV is known as a consequence of selecting the par¬ 
ticular integrity-checking method. It is therefore established at the time the cor¬ 
responding SA is set up and not changed as long as the SA exists. 

Anti-replay is supported provided integrity protection is enabled. This is 
accomplished using a sequence number derived from a running counter. The 
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counter is initialized to 0 when an SA is first set up and incremented before being 
copied into each datagram sent on the SA. When anti-replay is enabled (the nor¬ 
mal default), the sender checks to see that the counter has not wrapped and cre¬ 
ates a new SA if wrapping is about to occur. The receiver implementing anti-replay 
keeps a valid window of sequence numbers (similar in some ways to the TCP 
receiver's window). Datagrams containing out-of-window sequence numbers are 
dropped. 

For systems that implement auditing, ESP processing can result in one or more 
auditable events. These events include the following: no valid SA exists for a session, 
the datagram given to ESP for processing is a fragment, the anti-replay counter 
is about to wrap, a received packet was out of the valid anti-replay window, the 
integrity check failed. Auditable events are recorded in a logging system. These 
events include metadata such as the SPI value, current date and time, source and 
destination IP addresses, sequence number, and IPv6 flow ID (if present). 

18.8.3.2 ESP-NULL, Wrapped ESP (WESP), and Traffic Visibility 
As mentioned previously, ESP ordinarily provides privacy using encryption, but 
it can also operate in an integrity-only mode using the NULL encryption algo¬ 
rithm. Integrity-only mode (also called ESP-NULL) may be desirable in some cir¬ 
cumstances, especially in enterprise environments where sophisticated packet 
inspection takes place within the network and confidentiality may be addressed 
in other ways. For example, some network infrastructure devices inspect pack¬ 
ets for unwanted content (e.g., malware signatures) and are capable of providing 
alerts or shutting down network access when policy is violated. Such devices are 
essentially disabled if ESP is used with encryption in an end-to-end fashion (i.e., 
the way it was designed). Said another way, unless they have traffic visibility, they 
cannot do their jobs. 

When a packet inspection device is faced with ESP traffic, it needs to make 
a decision about whether the traffic is encrypted (i.e., whether NULL encryption 
is being used or not). Given that the negotiation of an IPsec cryptographic suite 
is handled outside ESP (e.g., manually or using a protocol such as IKE), there are 
two current methods for doing so. The first is simply to use a set of nonstandard 
heuristics to make a guess [RFC5879]. Use of these has the benefit of not requiring 
any modification to ESP for supporting traffic visibility. The other method is to 
add a special description to ESP to indicate whether encryption is used. Wrapped 
ESP (WESP) [RFC5840], a standards-track RFC, defines a header that is placed 
ahead of the ESP packet structure. WESP uses a different protocol number (141) 
from ESP and can be negotiated with IKE using the USE_WESP_MODE (value 
16415) Notify payload. The variable-length WESP header includes fields to indi¬ 
cate the location of payload information, along with a Flags field (maintained by 
the lANA [IWESP]) containing a bit indicating whether ESP-NULL is being used. 
Although WESP makes the job of determining whether ESP-NULL is being used 
or not easier for network infrastructure, its utility also depends on end hosts using 
the WESP header appropriately. Given that WESP is relatively new, this is not yet 
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the case today. On the other hand, the WESP format is extensible, so once imple¬ 
mented it could be adapted for other purposes in the future. 

18.8.4 Multicast 

IPsec optionally supports multicast operations [RFC5374], although this capabil¬ 
ity is not often used. The most basic form involves using manual key configura¬ 
tion, but there are also multicast group key establishment methods called group 
key management (GKM) protocols managed hy group controller/key servers (GCKSs). 
These are used to produce group security associations (GSAs), which include one 
or more IPsec SAs plus one or more GKM SAs used to provide parameters for 
establishing the IPsec SAs [RFG3740]. Given that members may dynamically join 
or leave a group, GKM protocols must deal with rekeying more frequently and 
carefully than regular two-party key establishment protocols, and such protocols 
have been a favorite topic for security researchers [AKNT04]. We shall not explore 
the details of how GKMs operate (such an explanation would be lengthy), but the 
interested reader may consult documentation for GDOI [RFG3547] or GSAKMP 
[RFG4535]. 

At present, multicast IPsec operation requires all members of a group to be 
homogeneous in their algorithmic and protocol processing capabilities. Both any- 
source and single-source multicast (ASM and SSM) operations are supported (see 
Ghapter 9), and the same procedures are used for IPv4 local broadcast addresses 
and for IPv6 anycast addresses. Host IPsec implementations may use any combi¬ 
nation of tunnel and transport mode, but SGs must use tunnel mode where the 
tunnel destination addresses are multicast addresses. 

Multicast IP datagrams present a challenge for IPsec when a tunnel mode is 
used because the outer IP datagram's addressing needs to be a multicast desti¬ 
nation address in order to be routed efficiently using a multicast-capable infra¬ 
structure. This requires a special procedure, known as tunnel mode with address 
preservation, to be applied when placing datagrams into AH or ESP tunnels. 
In short, this procedure involves choosing the outer IP source and destination 
addresses to match the inner addresses (assuming the same version of IP is being 
used). The purposes of doing so are (1) to ensure that multicast routing is invoked 
on the datagram and (2) to ensure that the reverse path forwarding (RPF) check 
used in computing multicast routes works properly (see Ghapter 9). 

Introduction of multicast requires modification of some of the low-level IPsec 
machinery we saw in Figure 18-10. For example, the SPD and SAD are modified to 
include an "address preservation" flag used in implementing the address-preserv¬ 
ing tunnel modes. In addition, a directionality flag in the SPD is used to determine 
under what circumstances SAs should be automatically created. This ensures that 
no SAs are created that would use prohibited multicast source addresses as a con¬ 
sequence of simply reversing source and destination IP addresses (as with unicast 
SAs). The SPD may need to include state as to when a GKM protocol needs to be 
invoked (e.g., for obtaining a needed group key), and a group PAD (GPAD) holds 
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the information specific to each GCKS, including for which traffic selectors each 
GCKS is able to produce SAs and authentication information that may be required 
to engage in a particular GKM protocol with a particular GGKS. GPAD material is 
not consulted by non-GKM protocols such as IKE, but the PAD and GPAD struc¬ 
tures might be implemented together. 

18.8.5 L2TP/IPsec 

The Layer 2 Tunneling Protocol (L2TP) (see Ghapter 3) supports tunneling of layer 
2 traffic such as PPP through IP and non-IP networks. It relies on authentication 
methods that provide some authentication during connection initiation, but no 
subsequent per-packet authentication, integrity protection, or confidentiality. To 
address this concern, L2TP can be combined with IPsec [RFG3193]. The combi¬ 
nation, called L2TP/IPsec, provides a recommended method to establish remote 
layer 2 VPN access to enterprise (or home) networks. L2TP can be secured with 
IPsec using either a direct L2TP-over-IP encapsulation (protocol number 115) or a 
UDP/IP encapsulation that eases NAT traversal. 

L2TP/IPsec uses IKE by default, although other keying methods are possible. 
It uses an ESP SA in either transport mode (support required) or tunnel mode 
(support optional). The SA is used to secure the L2TP traffic, which is then respon¬ 
sible for establishing the layer 2 tunnel. Because it is really a combination of two 
protocols, both of which involve authentication, L2TP/IPsec often requires two 
distinct authentication procedures: one for the machine (using IPsec with pre¬ 
shared keys or certificates) and another for the user (e.g., using a name and pass¬ 
word or access token). 

L2TP/IPsec is supported on most modern platforms. On Windows, creating a 
new connection with the "Gonnect to a workplace" option can be used to enable 
L2TP and L2TP/IPsec. Some smartphones (e.g.. Android, iPhone) support L2TP in 
their networking configuration setup screens. Mac OS X includes an L2TP/IPsec 
network adapter type that can be added using the system preferences. On Linux, 
it may be necessary to configure both IPsec and L2TP for them to work together. If 
L2TP is not required on such systems, direct IPsec may be preferable. 

18.8.6 IPsec NAT Traversal 

Using NATs with IPsec can present something of a challenge, primarily because IP 
addresses have traditionally been used in identifying communication endpoints 
and are assumed to not change. These assumptions were not entirely avoided (or 
obviated) when IPsec was first designed, so NAT has posed a problem. This is 
one factor contributing to the relatively slow deployment of IPsec. However, today 
IPsec supports both changing addresses (with MOBIKE) and NAT traversal. 

To have a complete NAT traversal solution, we must take into account IKE, AH, 
and ESP in both transport and tunnel modes. As we shall see, when NATs must be 
accommodated, not all combinations of IPsec may be usable with all applications. 
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Guidance for what a solution requires is given in [RFC3715]. We shall first discuss 
a variety of issues that highlight fundamental incompatibilities between NATs 
and IPsec and then describe the methods that have been adopted to handle the 
problems. 

One fundamental problem arises with AH and how NATs update the 
addresses in datagrams. Because the AH includes a MAC computation covering 
the datagram's IP addresses, a NAT is unable to rewrite addresses without invali¬ 
dating the AH. Note that ESP does not share this issue, as its integrity protection 
mechanism does not include the IP addresses in its MAC. 

Another problem arises with the UDP and TCP transport protocols because of 
the pseudo-header checksum, which incorporates IP addresses in its computation. 
When the transport-layer checksum is integrity-protected or encrypted, the NAT 
is unable to update the checksum without forming an invalid packet. A similar 
situation can arise for NAPT when changing port numbers, or for other protocols 
that perform layering violations. 

A third major problem relates to the ID payloads in IKE. There are several ways 
to identify an IKE peer, one of which is to use IP addresses. As these addresses are 
embedded within an encrypted IKE payload, they are not able to be modified by 
a conventional NAT, leading to failure. Alternative methods for identifying peers 
may be available, however (e.g., EQDN or the distinguished name from an X.509 
certificate). 

A fourth significant concern is how a NAT or NAPT demultiplexes incoming 
traffic to the proper host. In protocols such as TCP and UDP, the port number is 
used for this purpose. However, IPsec AH and ESP act like transport protocols 
that carry no port numbers but instead use an SPI value. While some NATs can 
make use of the SPI value for demultiplexing, these values are chosen by an IPsec 
responder as a local matter and multiple independent hosts may choose the same 
value. Because a NAT cannot easily modify these values, it is possible for a NAT to 
improperly demultiplex incoming (returning) traffic, with a potential for errone¬ 
ous delivery. 

There are other potential problems for NATs that become more acute when 
IPsec is employed. Por example, application protocols that carry IP addresses (e.g., 
SIP), if integrity-protected or encrypted, cannot be modified by a conventional 
NAT. In addition, configuration and analysis are more difficult because traffic that 
could otherwise be decoded for analysis is now obscured because of encryption. 
Portunately, some network analysis tools (e.g., Wireshark) can process encrypted 
traffic if provided the necessary key material. 

The primary approach to dealing with most of the NAT traversal concerns is 
to encapsulate IPsec ESP and IKE traffic using UDP/IP, which can be modified by 
conventional NATs when necessary. (There is no supported solution for NAT tra¬ 
versal of AH.) An IKE initiator can use UDP port 500 or 4500 for sending IKE and 
then transition to using port 4500 for UDP-encapsulated ESP and IKE, whether or 
not a NAT is present. UDP ESP encapsulation is prohibited on port 500 according 
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to [RFC5996]. The purpose of using port 4500 is to avoid some NATs that improp¬ 
erly process IPsec traffic on port 500. 

NAT traversal for IKE is an optional feature of an IKE implementation. If 
supported, the following two Notification payloads can be included with the 
IKE_SA_INIT exchange: NAT_DETECTION_DESTINATION_IP and NAT_ 
DETECTION_SOURCE_IP. If present, these appear after the Ni and Nr payloads 
and before CERTREQ payloads. The data associated with these payloads includes 
a SHA-1 hash of the SPIs for the SA, the source or destination IP address, and the 
source or destination port number. Such information is preserved as the IKE mes¬ 
sages are passed through NATs. When receiving IKE messages that suggest a NAT 
is present, IKE processing continues using a UDP/IP encapsulation on port 4500, 
which tends to pass through NATs unimpeded. 

After having traversed one or more NATs, arriving IKE traffic being used to 
set up a transport-mode SA may contain traffic selectors (TS payloads) with IP 
addresses or ranges that are not meaningful (i.e., they are private IP addresses 
"behind" a NAT) and that do not match the IP addresses contained in the address¬ 
ing fields of the IKE datagram arriving at the responder. This is handled by first 
storing the addresses in TSi and TSr IKE payloads for later use and later replacing 
them with the source and destination IP addresses present in the received data¬ 
gram. In essence, this is a form of "delayed NAT" on TS payloads performed by 
the recipient. The resulting datagram and TS payloads are used to query the SPD 
in order to determine the security policy for the requested SA. If transport mode is 
used, the responder completes the exchange and the initiator performs similar TS 
payload substitution processing (see Section 2.23.1 of [RPC5996] for more details). 

18.8.7 Example 

There are several open-source and proprietary IPsec implementations. Windows 7 
supports IKEv2 and MOBIKE in Microsoft's Agile VPN subsystem. Linux includes 
kernel-level IPsec support in kernel version 2.6 and later, and the OpenSwan and 
StrongSwan packages can be used to implement complete VPN solutions. In the 
following example, we use a Linux server running StrongSwan (IPv4 address 
10.0.0.3) with a Windows 7 client (IPv4 address 10.0.1.48) using RSA-based machine 
certificates we have created for authentication to demonstrate IKE. The IKE initial 
exchanges are shown in Eigure 18-22. 

Looking at this figure, we can see that Wireshark decodes the IKE exchange 
using ISAKMP as the protocol name. This is the now-deprecated Internet Security 
Association and Key Management Protocol and is the historical name of what ulti¬ 
mately became IKE. The IKE header contains the initiator's SPI (labeled "Initiator 
cookie") and the responder's SPI, which has not yet been established. The version 
number is 2, indicating that this packet contains IKEv2, and the exchange type is 
IKE_SA_INIT. 

Looking closer, we can see this is an IKE_SA_INIT message containing five 
payloads: one SA, one KE, one Nonce, and two of type Notify. The SA payload 
includes six proposals, each of which contains a list of transforms. The proposals 
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Figure 18-22 A trace of the initial IKE exchanges, highlighting the first packet. The IKE_SA_INIT exchange is 
carried on UDP port 500 and includes the initiator's SPI, proposals for cryptographic suite algo¬ 
rithms, DIT key exchange material, a nonce, and Notify payloads used to indicate addresses for 
NAT traversal. Each proposal in the SA payload requests the establishment of an IKE_SA using 
a set of transforms for encryption, integrity protection, a PRE used for generation of random 
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represent sets of algorithms the initiator is willing to use. Proposal 6 (the last one) 
has been expanded to show more detail. It suggests AES in CBC mode with a 256- 
bit key length for encryption, HMAC with SHA-256 for integrity protection, a PRF 
based on SHA-384, and the alternate 1024-bit MODP group for DH key agreement. 
The other proposals (not detailed) include suggestions for 3DES encryption, AES 
encryption with different key lengths, SHA-1 for integrity protection, and other 
SHA variants for the PRE. Eollowing the SA payload, the Key Exchange payload 
contains the public information required to perform a DH exchange using the 
"alternate 1024-bit MODP group." In the other payloads, we find a nonce contain¬ 
ing a 48-byte random bit string and two Notify payloads used for NAT traversal. 
The first Notify payload is of type NAT_DETECTION_SOURCE_IP, and the sec¬ 
ond contains NAT_DETECTION_DESTINATION_IP. The value in the first con¬ 
tains a 20-byte SHA-1 hash over these values: 8 bytes of the initiator's SPI, 8 bytes 
of the responder's SPI (0 here), 4 bytes of source IPv4 address, and 2 bytes of UDP 
source port number. The value in the second covers the same as the first, except 
the destination port is used in place of the source port. Figure 18-23 illustrates the 
response to the first IKE_SA_INIT message. 

In this figure, the IKE_SA_INIT message contains the following payloads: SA, 
KE, Nonce, three of type Notify, and a Certificate Request. The SA payload con¬ 
tains only one proposal, comprising the following transforms: 3DES for encryp¬ 
tion, HMAC_SHA1_96 for integrity, HMAC_SHA1 for the PRF, and group 2 for the 
DH exchange. The KE payload contains a 128-byte value from the 1024-bit MODP 
group. The Nonce payload contains a 32-byte random value for freshness. The next 
two Notify payloads contain NAT_DETECTION_SOURCE_IP and NAT_DETEC- 
TION_DESTINATION_IP, as described earlier. Following these are new payloads 
we have not yet encountered: CERTREQ and MULTIPLE_AUTH_SUPPORTED. 

The Certificate Request (CERTREQ) payload indicates the responder's pre¬ 
ferred certificates. In this case, the responder indicates that any certificates later 
supplied by the initiator should be associated with a particular certificate author¬ 
ity. The encoding used to express the CA is one of several defined in Section 3.6 
of [RFC5996], but only the values 4, 12, and 13 are currently standardized. Here, 
the payload contains the value 4, meaning the Certificate Authority Data subfield 
contains a concatenation of SHA-1 hashes of the public keys (X.509 Subject Public 
Key Info element) of trusted CAs. Given that the length of this subfield is only 20 
bytes in this example, we can see that only a single CA is listed. It happens to be 
the SHA-1 hash of the DER encoding of the public key of the sample root certificate 
for the "Test CA" we created for this example. 


Note 

The binary Distinguished Encoding Rules (DER) format is a subset of the ASN.1 
standard Basic Encoding Rules (BER). DER permits values to be encoded in only 
a single, unambiguous way. DER is one of the two most popular ways to encode 
X.509 certificates. The other is PEM, an ASCII format, which we showed earlier. Var¬ 
ious utilities, including openssl, may be used to convert between the two formats. 



870 


Security: EAP, IPsec, TLS, DNSSEC, and DKIM 
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RIe Edit ^ew Go Capture Analyze Statistics Telephcoy Tools Internals Help 
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No. Time Source Destinaticn Protocol Into 

10.000000 10.0.1.48 10.0.0.3 ISAKMP IKE_SA_INIT 



B Frame 2: 375 bytes on wire (3000 bits), 375 bytes captured (3000 bits) 
a Ethernet ii, src: O0:0b:db:bb:f3:64 (O0:0b:db:bb:f3:64), Dst: 00:27:i0:8e:a2:l4 ( 
a internet protocol version 4, src: 10.0.0.3 (10.0.0.3), Dst: 10.0.1.48 (10.0.1.48) 
a user Datagram protocol, src port: 500 (500), Dst Port: 500 (500) 

B Internet Security Association and Key Management Protocol 
initiator cookie: e9f321ebel9efa4e 
Responder cookie: 71cc31af621a3512 
Next payload: security Association (33) 
version: 2.0 

Exchange type: ike_sa_init (34) 
a Flags : 0x20 

Message id: 0x 00000000 
Length: 333 

B Type Payload: security Association (33) 

Next payload: Key Exchange (34) 

0. - Critical Bit: Not critical 

Payload length: 44 
a Type Payload: Proposal (2) # 1 

Next payload: none / no Next payload (0) 

0. - critical Bit: Not critical 

Payload length: 40 
proposal number: l 
protocol id: ike (1) 

SPi size: 0 
Proposal transforms: 4 
B Type Payload: Transform (3) 

Next payload: Transform (3) 

0. - critical Bit: Not Critical 

Payload length: 8 

Transform Type: Encryption Algorithm (encr) (1) 

Transform id (encr): encr_3des (3) 

Q Type Payload: Transform (3) 

Next payload: Transform (3) 

0. = critical Bit: Not critical 

payload length: 8 

Transform Type: integrity Algorithm (integ) (3) 

Transform ID (INTEG): AUTH_HMAC_SHA1_96 (2) 

B Type Payload: Transform (3) 

Next payload: Transform (3) 

0. - critical Bit: Not Critical 

Payload length: 8 

Transform Type: Pseudo-random Function (prf) (2) 

Transform id (prf): prf_hmac_shai (2) 

B Type Payload: Transform (3) 

Next payload: none / No Next Payload (0) 

0. = critical Bit: Not critical 

payload length: 8 

Transform Type: Diffie-Hellman Group (d-h) (4) 

Transform ID (D-M): Alternate 1024-bit MODP group (2) 

S Type Payload: Key Exchange (34) 
a Type payload: Nonce (40) 
a Type payload: Notify (41) 

Next payload: Notify (41) 

0. = critical Bit: Not critical 

Payload length: 28 
protocol ID: reserved (0) 

SPI Size: 0 

Notify Message Type: nat_detection_source_ip (16388) 

Notification data: f6616abe3c97c01bb5857953089dc03c84ccl66c 
a Type Payload: Notify (41) 

Next payload: Certificate Request (38) 

0. = critical Bit: Not critical 

Payload length: 28 
protocol id: reserved (0) 

SPI Size: 0 

Notify Message Type: nat_detection_destination_ip (16389) 

Notification data: ef3c8cc4d072ebfeef76eel002aa2315567fhalf 
a Type Payload: certificate Request (38) 

Next payload: Notify (41) 

0. = critical Bit: Not critical 

Payload length: 25 

certificate Type: x. 509 certificate - signature (4) 

Certificate Authority Data: 987f9dfbb804fc08ed632549f7d6d0e488810e5a 
a Type Payload: Notify (41) 

Next payload: none / no Next payload (0) 

0. - critical Bit: Not critical 

Payload length: 8 
Protocol id: reserved (0) 

SPI size: 0 

Notify Message Type: multiple_auth_supported (16404) 

Notification DATA: <MISSING> 


Figure 18-23 The completion of the IKE_SA_INIT exchange includes the responder's SPI (labeled 
"cookie"), a single proposal with transforms, DH parameters, a nonce value, and NAT 
traversal address parameters. This message also includes a CERTREQ payload to 
indicate and request acceptable certificates, and a notification indicating that multiple 
authentication methods (in series) are supported. 
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The final payload in Figure 18-23 is a Notify payload containing the MULTIPLE_ 
AUTH_SUPPORTED indication and no associated data. Defined as an experi¬ 
mental extension to IKE in [REC4739], it indicates the ability to use more than one 
authentication method. Such a situation may arise, for example, when using an 
IKE_AUTPI exchange based on certificates to establish IKE SAs to a service pro¬ 
vider, followed by some form of EAP-based authentication for the individual user. 

The remaining packets shown in Figure 18-23 contain IKE_AUTPI messages 
that are encrypted. They are carried using source and destination port number 
4500 instead of 500, and the encapsulation uses the special "non-ESP marker" con¬ 
taining 4 bytes of 0 [RFC3947], indicating that the traffic is IKE and not ESP. The 
marker and port numbers are also used for the INEORMATIONAL exchanges we 
discussed previously. 

Wireshark has the capability to decrypt encrypted IKE traffic if provided with 
the proper keys and SPI values. By providing a copy of the log trace file from the 
IKE server to Wireshark (located under Edit I Preferences I Protocols I ISAKMP), 
we can see the decrypted IKE payload information. (The Wireshark developers 
tend to prefer the original names of protocols such as ISAKMP and SSL instead of 
IKE and TLS, so that is what we see when looking at Wireshark output.) 

The third packet in Eigure 18-22 is the first fragment of a UDP/IP datagram 
that Wireshark reassembles when it receives the second fragment (packet 4). The 
decrypted and reassembled result is shown in Eigure 18-24. 

Plere we can see the contents of the reassembled and decrypted UDP/IPv4 
fragments constituting the first packet of the IKE_AUTPI exchange. The client 
provides the following IKE payloads: IDi, CERT, CERTREQ, AUTH, N(MOBIKE_ 
SUPP), CP, SA, TSi, and TSr. The IDi payload contains the name of the initiator, 
test client. The CERT payload contains a client certificate for test client signed by 
the Test CA certificate authority that we know the corresponding server should 
accept (because it was configured to). The CERTREQ payload contains requests 
for Test CA as well as 21 other CAs (not shown) known by this Windows 7 cli¬ 
ent. The AUTPI payload contains a data block signed using the RSA private key 
of the initiator (see Section 2.15 of [REC5996]), which provides origin authentica¬ 
tion. The N(MOBIKE_SUPPORTED) indicates the client's willingness to follow 
the MOBIKE protocol. The CP(CEG_REQUEST) payload (not detailed) contains 
the following attributes: INTERNAL_IP4_ADDRESS, INTERNAL_IP4_DNS, 
INTERNAL_IP4_NBNS, and a PRIVATE_USE type (23456). These are used to 
help in configuring VPN access and serve a similar purpose to the configuration 
information typically provided locally by DPICP (see Chapter 6). NBNS refers to a 
NetBIOS name server. NetBIOS is an API that can be implemented on a number of 
networking protocols and is common in Microsoft Windows environments. 

The SA payload in Eigure 18-24 represents the information required to form a 
CPIILD_SA. There are two proposals (not detailed), each for ESP using 32-bit SPI 
values (note that IKE uses 64-bit SPI values) with AUTPI_PIMAC_SPIA1_96 as the 
integrity algorithm and not using extended sequence numbers (indicated using 
a proposal transform). The first proposal suggests the use of ENCR_AES_CBC 
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Fragmented ip protocol Cproto=UDP 0x11, off=0, iD=026e) [Reassembled in #4] 


5 0.202014 10.0.0.3 10.0.1.48 ISAKMP IKE_AUTH 


a Frame 4: 410 bytes on wire (3280 bits), 410 bytes captured (3280 bits) 

a Ethernet ii. src: 00:27:l0:8e:a2:l4 (00:27:l0:8e:a2:l4), ost: 00:0b:db:bb:f3:64 (00:0b:db:bb;f3:64) 
a Internet Protocol version 4, Src: 10.0.1.48 (10.0.1.48), Ost: 10.0.0.3 (10.0.0.3) 
a user Datagram Protocol, Src Port: 4500 (4500), Ost Port: 4500 (4500) 
a UDP Encapsulation of IPsec Packets 

a internet security Association and Key Management Protocol 
initiator cookie: e9f321ebel9efa4e 
Responder cookie: 71cc31af621a3512 
Next payload: Encrypted and Authenticated (46) 
version: 2.0 

Exchange type: IKE_AUTH (35) 

Q Flags: 0x08 

Message ID: 0x00000001 
Length: 1844 

S Type payload: Encrypted and Authenticated (46) 

Next payload: identification - initiator (35) 

0. = critical Bit: Not critical 

Payload length: 1816 

Initialization vector: 289e871ca8981f65 (8 bytes) 

Encrypted Data (1792 bytes) 

Q Decrypted Data (1792 bytes) 
a Contained Data (1786 bytes) 

a Type payload: identification - initiator (35) 
a Type payload: certificate (37) 

Next payload: certificate Request (38) 

0. = critical sit: Not critical 

Payload length: 805 

Certificate Encoding: X.509 Certificate - Signature (4) 
a Certificate Data (id-at-ccmmonName»test client) 
a Type Payload: Certificate Request (38) 
a Type payload: Authentication (39) 

Next payload: Notify (41) 

0. = critical Bit: Not critical 

Payload length: 264 

Authentication Method: rsa Digital signature (1) 

Authentication Data: 55196721c63bl07a380b48a67cf6b41f829c9eb9073f0dbf... 
a Type Payload: Notify (41) 
a Type Payload: Configuration (47) 

Next payload: security Association (33) 

0. = critical sit: Not critical 

Payload length: 24 
Type: cfg_request (i) 

a Attribute Type: (t=l,l=0) internal_ip4_address 
a Attribute Type: (t=3,l=0) INTERNAL_IP4 _dns 
a Attribute Type: (t=4,l=0) INTERNAL_IP4 _nbns 
a Attribute Type: (t-23456,1-0) PRIVATE USE 
a Type Payload: security Association (33) 

Next payload: Traffic selector - initiator (44) 

0. = critical sit: Not critical 

Payload length: 80 
a Type Payload: proposal (2) # 1 
a Type Payload: Proposal (2) # 2 
a Type Payload: Traffic Selector - Initiator (44) # 2 
Next payload: Traffic Selector - Responder (45) 

0. - critical Bit: Not critical 

payload length: 64 

Number of Traffic selector: 2 

Traffic selector Type: ts_ipv6_addr_range (8) 

protocol id: unused 

Selector Length: 40 

Start Port: 0 

End Port: 65535 

starting Addr: :: (::) 

Ending Addr: ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff (ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff) 
Traffic selector Type: ts_ipv4_addr_range (7) 

Protocol id: unused 
selector Length: 16 
start Port: 0 
End Port: 65535 

Starting Addr: 0.0.0.0 (0.0.0.0) 

Ending Addr: 255.255.255.255 (255.255.255.255) 
a Type Payload: Traffic selector - Responder (45) # 2 
padding (5 bytes) 

Pad Length: 5 

integrity checksum Data: 35e21f9463059736c0599e89 (12 bytes)[correct] 


Figure 18-24 The IKE_AUTH exchange contains encrypted information and operates on UDP port 4500. The 
reassembly of two fragments produces an IKE message with an Encrypted/Authenticated data 
payload containing the following payloads: Identification initiator (IDi), Certificate (CERT), Cer¬ 
tificate Request (CERTREQ), Authentication (AUTH), Notify (N), Configuration (CP), Security 
Association (SA), Traffic Selector initiator (TSi), and Traffic Selector responder (TSr). 
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(256-bit keys) for encryption, and the second suggests ENCR_3DES. Because there 
is no N(USE_TRANSPORT_MODE) payload present, we conclude that each of the 
proposals involves using ESP in the default tunnel mode. 

The Traffic Selector (TSi and TSr) payloads in Pigure 18-24 indicate the IPv4 
and IPv6 address ranges that are permitted to be associated with the forming SA. 
The TSi has both a TS_IPv6_ADDR_RANGE and TS_IPv4_ADDR_RANGE that 
contain their entire address and port number ranges. TSr (not detailed) contains 
the same values. 

The first IKE_AUTH message we just discussed is fairly complicated and 
requires more than a single 1500-byte UDP/IPv4 datagram to hold it. After pro¬ 
cessing by the responder, the final message in the exchange is produced. It is 
shown in Pigure 18-25. 

In this figure, the server sends a response with the following payloads: IDr, 
GERT, AUTH, GP(GPG_REPLY), SA, TSi, TSr, N(AUTH_LIPETIME), N(MOBIKE_ 
SUPPORTED), and N(NO_ADDITIONAL_ADDRESSES). The IDr payload contains 
a DER-encoded name of the server. The GERT payload contains the matching (server) 
certificate, and the AUTH payload indicates knowledge of the corresponding pri¬ 
vate key. The GP(GPG_REPLY) payload includes an INTERNAL_IP4_ADDRESS 
attribute, which is useful for VPN configuration. The SA payload is similar to the cli¬ 
ent's SA payload from Pigure 18-24 and includes a single proposal with transforms 
ENGR_AES_GBG (256-bit keys), AUTH_HMAG_SHA1_96, and no ESNs. 

The TSi and TSr values in this packet have been "narrowed" to be much smaller 
ranges than in the client's IKE_AUTH message. In this case, the TSi is narrowed to 
the single IPv4 address 10.100.0.1. The TSr has been narrowed to 10.0.0.0/16. Each 
uses the full port range 0-65535. This is a relatively simple case of narrowing. In 
cases where more than one discontinuous subset of the range specified by the 
initiator is acceptable, an N(ADDITIONAL_TS_POSSIBLE) payload may be gener¬ 
ated. Narrowing is used to achieve mutually agreeable address ranges for an SA. 

The N(AUTH_LIPETIME) payload indicates that the authentication is going 
to last at most only 2.8 hours (10,154s, expressed as 000027aa in the trace). The 
N(MOBIKE_SUPPORTED) payload indicates the responder's support for MOBIKE. 
The N(NO_ADDITIONAL_ADDRESSES) payload (not detailed) is used with 
MOBIKE to indicate that no additional IP addresses other than those used in the 
exchange are being used. 

At this point, a tunnel mode ESP GHILD_SA has been set up and traffic can 
flow. We do not detail the traffic flow containing ESP packets (they are compara¬ 
tively straightforward) but instead jump to the point where the SAs are to be torn 
down. This is accomplished using two sets of INPORMATIONAL exchanges con¬ 
taining Delete payloads—one for the ESP SA and one for the IKE SA. Eigure 18-26 
shows the request to close the ESP SA. 

We can see in this figure the SA being deleted based on a close request at 
the client. Like other IKE traffic, it includes an encrypted and authenticated pay- 
load. The encrypted payload in turn includes a single Delete payload. The Delete 
payload can indicate that more than one SPI is to be deleted, but in this case it 
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a Frame 5: 1402 bytes on wire (11216 bits), 1402 bytes captured (11216 bits) 

B Ethernet ii, src: 00:0b:db:bb:f3:64 (00:0b:db:bb:f3:64), Dst: 00:27:10:8e:a2:14 (00:27:10:8e:a2:14) 
a internet Protocol version 4, src: 10.0.0.3 (10.0.0.3), ost: 10.0.1.48 (10.0.1.48) 
a user Datagram Protocol, src Port: 4500 (4500), Dst Port: 4500 (4500) 
a UDP Encapsulation of IPsec Packets 

a internet security Association and Key Management Protocol 
initiator cookie: e9f321ebel9efa4e 
Responder cookie: 71cc31af621a3512 
Next payload: Encrypted and Authenticated (46) 
version: 2.0 

Exchange type: ike_auth (35) 
a Flags: 0x20 

Message id: 0x00000001 
Length: 1356 

a Type Payload: Encrypted and Authenticated (46) 

Next payload: identification - Responder (36) 

0. = critical Bit: Not critical 

Payload length: 1328 

Initialization vector: Iel74cb2c36bl6f5 (8 bytes) 

Encrypted Data (1304 bytes) 
a Decrypted Data (1304 bytes) 
a contained Data (1295 bytes) 

a Type Payload: Identification - Responder (36) 
a Type Payload: Certificate (37) 
a Type Payload: Authentication (39) 
a Type Payload: configuration (47) 

Next payload: security Association (33) 

0. = Critical Bit: Not Critical 

Payload length: 16 
Type: cfg_reply (2) 

a Attribute Type: (t=l,l=4) internal_ip4_address 
T ype: INTERNAL_IP4_ADORESS (1) 

0. = Config Attribute Format: Type/LengthA'alue (TLV) 

Length: 4 
value: 0a640001 

INTERNAL IP4 ADDRESS: 10.100.0.1 (10.100.0.1) 
a Type Payload: Security Association (33) 

Next payload: Traffic selector - initiator (44) 

0. - critical Bit: Not critical 

payload length: 44 
a Type Payload: Proposal (2) # 1 
a Type Payload: Traffic Selector - initiator (44) # 1 
Next payload: Traffic selector - Responder (45) 

0. =■ critical Bit: Not critical 

payload length: 24 

Number of Traffic Selector: 1 

Traffic Selector Type: ts_IPV4^DDR_range (7) 

Protocol id: unused 
selector Length: 16 
start port: 0 
End Port: 65535 

starting Addr: 10.100.0.1 (10.100.0.1) 

Ending Addr: 10.100.0.1 (10.100.0.1) 
a Type Payload: Traffic selector - Responder (45) # 1 
Next payload: Notify (41) 

0. = Critical Bit: Not Critical 

Payload length: 24 

Number of Traffic selector: 1 

Traffic selector Type: TS_IPV4_ADDR_RANGE (7) 

Protocol ID: unused 
Selector Length: 16 
start port: 0 
End port: 65535 

starting Addr: 10.0.0.0 (10.0.0.0) 

Ending Addr: 10.0.255.255 (10.0.255.255) 
a Type Payload: Notify (41) 

Next payload: Notify (41) 

0.= critical Bit: Not critical 

Payload length: 12 
Protocol ID: RESERVED (0) 

SPi size: 0 

Notify Message Type: authllifetime (16403) 

Notification DATA: 000027aa 
a Type Payload: Notify (41) 

Next payload: Notify (41) 

0. - critical Bit: Not critical 

payload length: 8 
Protocol id: reserved (o) 

SPI Size: 0 

Notify Message Type: mobike_supported (16396) 

Notification data: <missing> 
a Type payload: Notify (41) 

Padding (8 bytes) 

Pad Length: 8 

integrity checksum Data: 11867359f9bbe3df7add3fa7 (12 bytes)[correct] 


Figure 18-25 Completing the IKE_AUTH exchange, the responder produces an Encrypted/Authenticated data 
payload containing the following payloads: Identification responder (IDr), CERT, AUTH, CP(CFG_ 
REPLY), SA, narrowed TSi and TSR, along with N(AUTH_L1FET1ME), N(MOBlKE_SUPPORTED), 
and N(NO_ADDlTIONAL_ADDRESSES). The first CHILD_SA can now commence. 
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File Edit View Go 
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No. Time 

Source Destination 

Protocol Info 

A 


fSSBKEBESSn 



7 26.422427 

10.0.0.3 10.0.1.48 

ISAKMP INFORMATIONAL 


8 26.451483 

10.0.1.48 10.0.0.3 

ISAKMP INFORMATIONAL 


9 26.452638 

10.0.0.3 10.0.1.48 

ISAKMP INFORMATIONAL 


.... 


S Frame 6: 114 bytes on wire (912 bits), 114 bytes captured (912 bits) 


B 

Ethernet II, Src: 00:27:10:8e:a2:14 (00:27:10:8e:a2:14), 

Dst: 

00:0b :db:bb :f3:64 (1 

ti 

Internet Protocol version 4, src: 10.0.1.48 (10.0.1.48), 

Dst: 

10.0.0.3 (10.0.0.3)1 

a 

user Datagram Protocol, src Port: 4500 (4500), Dst Port: 

4500 

o 

o 

a 

UDP Encapsulation of iPsec packets 




S Internet Security Association and Key Management Protocol 
Initiator cookie: e9f321ebel9efa4e 
Responder cookie: 71cc31af621a3512 
Next payload: Encrypted and Authenticated (46) 
version: 2.0 

Exchange type: informational (37) 

B Flags: 0x08 

Message id: 0x 00000002 
Length: 68 

a Type Payload: Encrypted and Authenticated (46) 

Next payload: Delete (42) 

0.= critical Bit: Not critical 

Payload length: 40 

Initialization vector: 6bf431884d90c50c (8 bytes) 

Encrypted Data (16 bytes) 
a Decrypted Data (16 bytes) 
a contained Data (12 bytes) 

« Type Payload: Delete (42) 

Next payload: none / no Next payload (0) 

0. = Critical Bit: Not Critical 

Payload length: 12 
Protocol id: esp (3) 

SPI size: 4 
Port: 1 

Delete SPl: 6cfca5ef 
Padding (3 bytes) 
pad Length: 3 

Integrity Checksum Data: 5e01684aalbl04945c09936f (12 bytes)[correct] 

> 


Figure 18-26 A request to delete the child ESP SA with SPI 6cfca5ef is carried on the IKE SA. The 
Delete payload shows Port: 1, which is mislabeled by Wireshark. (It should be Number 
of SPIs: 1.) 


indicates only the one with SPI value 0x6cfca5ef. Packet 7 from the responder is 
essentially the same but contains a different setting in the Flags field (responder 
insfead of inifiafor and response insfead of requesf), a differenf encrypfion IV 
and infegrify checksum dafa, and specificafion of a differenf SPI (c348faf2) in fhe 
Delefe payload. 

To close fhe IKE_SA, anofher exchange of INFORMATIONAL messages is 
required. The inifiafor begins wifh fhe packef shown in Figure 18-27. We can see 
here a requesf fo close fhe IKE SA. Encrypfed like ofher fraffic, fhe Delefe payload 
does nof need fo include an SPI value because if is implied fo be fhe IKE SA car¬ 
rying fhe delefion requesf. To complefe fhe IKE SA delefion, fhe responder replies 
wifh an IKE message confaining only an empfy encrypfed/aufhenficafed payload 
fype in packef 9. Ifs Next Payload type field is NONE (zero). This indicafes fhe 
complefion of fhe IKE SA delefion. 
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6 

26.420009 

10.0.1.48 

10.0.0.3 

ISAKMP 

INFORMATIONAL 

7 

26.422427 

10.0.0.3 

10.0.1.48 

ISAKMP 

INFORMATIONAL 







9 

26.452638 

10.0.0.3 

10.0.1.48 

ISAKMP 

INFORMATIONAL 


> 


S Frame 8: 114 byres on wire 012 bits), 114 bytes captured (912 bits) 


9 

Ethernet II, Src: 00:27:10:8e:a2:14 (00:27:10:8e:a2:14), 

Dst: 

00:0b:db:bb:f3:64 

a 

Internet Protocol version 4, src: 10.0.1.48 (10.0.1.48), 

Dst: 

10.0.0.3 (10.0.0. 

a 

user Datagram Protocol, src Port: 4500 (4500), Dst Port: 

4500 

o 

o 

a 

UDP Encapsulation of iPsec Packets 




9 Internet Security Association and Key Management Protocol 
Initiator cookie: e9f321ebel9efa4e 
Responder cookie: 71cc31af621a3512 
Next payload: Encrypted and Authenticated (46) 
version: 2.0 

Exchange type: informational (37) 

IS Flags: 0x08 

Message ID: 0x00000003 
Length: 68 

a Type Payload: Encrypted and Authenticated (46) 

Next payload: Delete (42) 

0.= Critical Bit: Not Critical 

Payload length: 40 

Initialization vector: 86ebf4d5838b3dcd (8 bytes) 

Encrypted Data (16 bytes) 

9 Decrypted Data (16 bytes) 
a contained Data (8 bytes) 
a Type Payload: Delete (42) 

Next payload: none / no Next Payload (0) 

0. = Critical Bit: Not Critical 

Payload length: 8 
Protocol id: ike (1) 

SPI size: 0 
Port: 0 

Padding (7 bytes) 
pad Length: 7 

Integrity Checksum Data: 3cdcf499606184458ef7f5a7 (12 bytes)[correct] 


Figure 18-27 A request to delete the IKE SA. SPI values are not required because the entire message 
is carried on the IKE SA and there is no ambiguity. 


18.9 Transport Layer Security (TLS and DTLS) 

So far we have discussed security protocols at layers 2 and 3. The most widely 
used protocol for security operates just above the transport layer and is called 
Transport Layer Security (TLS). TLS is used for securing Web communications and 
for several other popular protocols, including POP and IMAP (which are called 
POP3S and IMAPS, respectively, when protected with TLS). One reason for TLS's 
popularity is that it can be implemented within or underneath applications that 
ride on top of the lower layers, whereas protocols such as EAP and IPsec usually 
require capabilities within the operating systems and protocol implementations of 
hosts and embedded devices. 

There are several versions of TLS and its predecessor, the Secure Sockets Layer 
(SSL) [RFC6101]. We shall focus on TLS version 1.2 [RFC5246], which is the most 
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recent at the time of writing. TLS 1.2 can support backward compatibility with 
most older versions of TLS and SSL (e.g., TLS 1.0, 1.1, and SSL 3.0). However, SSL 
2.0 is weaker, and while inferoperabilify wifh if is possible, if is now prohibifed 
[RFC6176]. Affer discussing TLS 1.2, which operafes over a sfream-orienfed profo- 
col (usually TCP), we will look af fhe dafagram-orienfed varianf called fhe Datagram 
Transport Layer Security (DTLS) [RFC4347]. DTLS is slowly gaining popularify for 
some applicafions such as VPN implemenfafions fhaf do nof use IPsec. Ifs currenf 
specificafion is based on TLS 1.1 [RFC4346], buf updafes are under way [IDDTLS]. 

18.9.1 TLS 1.2 

The securify goals of TLS are nof unlike fhose for IPsec, buf TLS operafes af a 
higher layer. Confidenfialify and dafa infegrify are provided based on a variefy of 
cryptographic suifes fhaf use cerfificafes fhaf can be provided by a PKI. TLS can 
also esfablish secure connecfions befween fwo anonymous parfies (wifhouf using 
cerfificafes), buf fhis applicafion is vulnerable to a MITM attack (nof surprising, 
given fhaf each end is nof even sfrongly idenfified). The TLS profocol has fwo 
layers of ifs own, called fhe record layer and fhe upper layer. The Record profocol 
implemenfs fhe record (lower) layer and is assumed to be layered on a reliable 
underlying profocol (e.g., TCP). Figure 18-28 shows fhe basic organizafion. 
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Figure 18-28 The TLS protocol "stack" has a lower record layer and three of its own upper-layer protocols 
called handshaking protocols. A fourth upper-layer protocol is the application protocol using 
TLS. The record layer provides fragmentation, compression, integrity protection, and encryp¬ 
tion. The handshaking protocols perform many of the same tasks for TLS that IKE does for IPsec. 


TLS is a client/server protocol, designed to support security for a connection 
between two applications. The Record protocol provides fragmentation, compres¬ 
sion, integrity protection, and encryption for data objects exchanged between 
clients and servers, and the handshake protocols establish identities, perform 
authentication, indicate alerts, and provide unique key material for the Record 
protocol to use on each connection. The handshaking protocols comprise four spe¬ 
cific protocols: the Handshake protocol, the Alert protocol, the Change Cipher 
Spec protocol, and the application data protocol. Like IPsec, TLS is extensible and 
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can accommodate existing or future cryptographic suites, which TLS calls cipher 
suites (CS). Many such combinations have been defined, and the lANA maintains 
a registry of the current set [TLSPARAMS]. Modern variants of TLS are based 
on SSL 3.0, originally developed by Netscape. TLS and SSL do not directly inter¬ 
operate, but there are negotiation mechanisms that allow clients and servers to 
dynamically discover which protocol to use when a connection is first established. 

The Change Cipher Spec protocol is used to change the current operating 
parameters. This is accomplished by first using the Handshake protocol to set up 
a "pending" state, followed by an indication to switch from the current state to the 
pending state (which then becomes the current state). Such switching is allowed 
only after the pending state has been readied. TLS depends on five cryptographic 
operations: digital signing, stream cipher encryption, block cipher encryption, 
AEAD, and public key encryption. For integrity protection, the TLS record layer 
uses HMAC. For key generation, TLS 1.2 uses a PRF based on HMAC with SHA- 
256. TLS also integrates an optional compression algorithm that is negotiated 
when a connection is first established. 

18.9.1.1 TLS Record Protocol 

The Record protocol uses an extensible set of record content type values to iden¬ 
tify which message type (i.e., which of the higher-layer protocols) is being mul¬ 
tiplexed. At any given point in time, the Record protocol has an active current 
connection state and another set of state parameters called the pending connection 
state. Each connection state is further divided into a read state and a write state. 
Each of these states specifies a compression algorithm, encryption algorithm, and 
MAC algorithm to be used for communication, along with any necessary keys 
and parameters. When a key is changed, the pending state is first set up using 
the Handshake protocol, and then a synchronization operation (usually accom¬ 
plished using the Cipher Change protocol) sets the current state equal to the pend¬ 
ing state. When first initialized, all states are set up with NULL encryption, no 
compression, and no MAC processing. 

The Record protocol's processing flow is shown in Figure 18-29. It divides 
(fragments) higher-layer information blocks into records called TLSPlaintext 
records, which can be at most 2*^ bytes in length (but are usually much less). The 
choice of record size resides within TLS; higher-layer message boundaries are 
not preserved. Once formed, TLSPlaintext records are compressed using a com¬ 
pression algorithm [RFC3749] identified in the current connection state. There is 
always one compression protocol active, although it may be (and usually is) the 
NULL compression protocol (which, not surprisingly, provides no compression 
gain). The compression algorithm converts a TLSPlaintext record into a TLSCom- 
pressed structure. Compression algorithms are required to be lossless and may not 
produce an output that is larger than the input by more than 1KB. To protect the 
payload from disclosure and modification, encryption and integrity protection 
algorithms convert a TLSCompressed structure into a TLSCiphertext structure, 
which is then sent on the underlying transport connection. 
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Figure 18-29 The TLS record layer starts with a TLSPIaintext record, which is compressed by a lossless com¬ 
pression algorithm to form a TLSCompressed record. The TLSCompressed record is encrypted 
(and has a MAC applied) to form a TLSCiphertext record, which is sent for transmission. Con¬ 
ventional stream and block ciphers require a MAC, and block ciphers may include padding. 
When using AEAD ciphers, a nonce is included with the encrypted and integrity-protected con¬ 
tent, but no separate MAC is used. 


Referring to Figure 18-29, when producing a TLSCiphertext structure, a 
sequence number is first computed (but not placed in the message), then a MAC 
is computed if necessary, and finally symmetric encryption is performed. Prior 
to encryption, the message may be padded (up to 255 bytes) to meet any block 
length requirements imposed by the encryption algorithm (e.g., for block ciphers). 
A MAC is not required for AEAD algorithms that provide both integrity and 
encryption (e.g., CCM, GCM), but a nonce is used in such cases. 

Keys for the Record protocol are derived from a master secret provided by some 
method outside the Record protocol, most often by the Handshake protocol. Using 
the master secret, along with random values provided by the client and server 
applications at the beginning of the connection, the following keys are generated: 

M I M I D ID I IV I IV = PRF(master secret, "key expansion", 
server_random -i- client_random) 
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In this assignment, I is the splitting operator and + is the concatenation opera¬ 
tor. denotes the MAC write key for the client, denotes the MAC write key for 
the server, denotes the client's data write key, denotes the server's data write 
key, IV^ denotes the client's IV, and IV^ denotes the server's IV. With the I opera¬ 
tor, each key uses however many bytes from the PRF function are required. MAC, 
encryption, and IV keys, if used, have a fixed length based on the cipher suite 
selected. The last two values are used only in cases where implicit nonce genera¬ 
tion takes place with AEAD ciphers (see Section 3.2.1 of [RFC5116]). According to 
[RFC5246], the cipher suite requiring the most material is AES_256_CBC_SHA256. 
It requires four 32-byte keys, for a total of 128 bytes. 

18.9.1.2 TLS Handshaking Protocols 

There are three subprotocols to TLS, which perform tasks roughly equivalent 
to those performed by IKE in IPsec. More specifically, these other protocols are 
identified by numbers used for multiplexing and demultiplexing by the record 
layer and are called the Handshake protocol (22), Alert protocol (21), and Cipher 
Change protocol (20). The Cipher Change protocol is very simple. It consists of one 
message containing a single byte that has the value 1. The purpose of the message 
is to indicate to the peer a desire to change from the current to the pending state. 
Receiving such a message moves the read pending state to the current state and 
causes an indication to the record layer to transition to the pending write state as 
soon as possible. This message is used by both client and server. 

The Alert protocol is used to deliver status information from one end of a 
TLS connection to another. This can include terminating conditions (either fatal 
errors or controlled shutdowns) or nonfatal error conditions. As of the publica¬ 
tion of [RFC5246], 24 alert messages were defined in standards. More than half of 
them are always fatal (e.g., bad MACs, missing or unknown messages, algorithm 
failures). 

The Handshake protocol sets up the relevant connection operating parameters. 
It allows the TLS endpoints to achieve six major objectives: agree on algorithms 
and exchange random values used in forming symmetric encryption keys, estab¬ 
lish algorithm operating parameters, exchange certificates and perform mutual 
authentication, generate a session-specific secret, provide security parameters to 
the record layer, and verify that all of these operations have executed properly. 
Figure 18-30 shows the messages required. 

The handshake shown in Figure 18-30 begins with Hello messages. The 
ClientHello message is usually the first message sent from client to server. It con¬ 
tains a session ID, proposals for the cryptographic suite number (CS in Figure 
18-30), and a set of acceptable compression algorithms (which are usually just 
NULL, although [RFC3749] also defines DEFLATE). TLS supports in excess of 250 
cipher suite options [TLSPARAMS]. 

The ClientHello message also contains the TLS version number and a ran¬ 
dom number called ClientHello.random. Upon receiving the ClientHello message, 
the server checks to see if the session ID is present in its cache. If so, the server 



Section 18.9 Transport Layer Security (TLS and DLLS) 


881 



Figure 18-30 The normal TLS connection initiation exchange consists of several messages that 
may be pipelined. Required messages have solid arrows and are shown in boldface 
type. An abbreviated exchange takes place if a previously existing connection can be 
restarted. This avoids endpoint authentication, which can be costly for systems with 
limited processing capabilities. 


may agree to continue a previously existing connection (called a "resume") by 
performing an abbreviated handshake. The abbreviated handshake is key to TLS 
performance and avoids having fo repeafedly verify fhe aufhenficify of each end- 
poinf, buf if does require synchronizafion wifh respecf fo fhe cipher specificafion. 
The ServerHello message complefes fhe firsf parf of fhe exchange by carrying fhe 
server's random number {ServerHello.random) fo fhe clienf. This message also con- 
fains a session ID value. If fhe value is fhe same as fhaf provided by fhe clienf, if 
indicafes fhe server's willingness fo resume. If nof, if has fhe value 0 and a full 
handshake is required. 

If a full (nonabbreviafed) handshake is execufed, fhe exchange of Hello mes¬ 
sages resulfs in each end becoming aware of fhe cipher suifes, compression algo- 
rifhms, and random values of ifs peer. The server selecfs among fhe crypfographic 
suifes specified by fhe clienf and may be required fo provide ifs cerfificafe chain in 
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a Certificate message if it is to be authenticated (which is the typical case for secure 
Web traffic or HTTPS). The server may also send a ServerKeyExchange message if 
its certificate is not valid for signing, it has no certificate, or a temporary or ephem¬ 
eral key is to be used to generate session keys. 


Note 

The ServerKeyExchange message is used only in cases where the Certificate 
(server) message does not contain enough information to establish a premaster 
secret. Such cases include anonymous or ephemeral DH key agreement (i.e., 
cipher suites starting with TLS_DHE_anon, TLS_DHE_DSS, TLS_ DHE_RSA). 
The ServerKeyExchange message is not used for other suites, including those 
starting with TLS_RSA, TLS_DH_DSS, or TLS_DH_RSA. 


At this point, the server may require client authentication. If so, it generates 
a CertificateRequest message. Once this message is sent, the server completes the 
second portion of the exchange by sending the mandatory ServerHelloDone mes¬ 
sage. Upon receiving this (possibly pipelined) message from the server, the client 
may be required to prove its identity (i.e., knowledge of an appropriate private 
key corresponding to a certificate). If so, it first sends its certificate using a Cer¬ 
tificate message in the same format used by the server. It then sends the manda¬ 
tory ClientKeyExchange message. The contents of this message depend on the 
cryptographic suite used, but it generally contains either an RSA-encrypted key or 
Diffie-Hellman parameters that may be used to create a type of seed for creating 
new keys (called the premaster secret). Einally, it sends a CertificateVerify message 
to demonstrate that it possesses the private key corresponding to the previously 
provided certificate, if the server requested client authentication. This message 
contains a signature on the hash of all of the handshake messages the client has 
received and sent up to this point. 

The final portion of the exchange includes a ChangeCipherSpec message, 
which is an independent TLS protocol content type (i.e., technically not a Hand¬ 
shake protocol message). However, the mandatory Handshake protocol Einished 
messages can be exchanged only after a successful exchange of ChangeCipher¬ 
Spec messages. The Einished messages are the first ones to be protected using the 
parameters exchanged up to this point. The Einished message themselves contain 
"verify data," which consists of the following value: 

verify_data = PRE(master_secret, finished_label, Hash(handshake_messages)) 

where finished_label has the value "client finished" for the client and 
"server finished" for the server. The particular hash function Hash is associ¬ 
ated with the selection of the PRE made during the initial Hello exchange. TLS 1.2 
provides the ability to have variable-length verify data, but all previous versions 
and current cipher suites produce 12 bytes of verify data. The 48-byte master_ 
secret value is computed as follows: 
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master_secret = FKF{premaster secret, "master secret", 
ClientHello.random + ServerHello.random) 

where + is the concatenation operator. The Finished message is important because 
it can be used to know with a high degree of certainty that the Handshake proto¬ 
col has completed successfully and subsequenf dafa exchange can fake place. 

18.9.1.3 TLS Extensions 

If we compare fhe capabilifies of IKE and TLS we have discussed so far, we can 
see fhaf IKE includes fhe abilify fo carry informafion beyond fhaf required for 
basic SA esfablishmenf. This is accomplished using IKE Nofify and Configurafion 
payloads. To provide a similar exfensible mechanism for TLS, various extensions 
can be included wifh TLS 1.2 messages in a sfandard way. The baseline specif ica- 
fion for TLS 1.2 [REC5246] includes a "signafure algorifhms" exfension fhaf a clienf 
uses fo specify fo a server whaf fypes of hash and signafure algorifhms if supporfs 
(MD5, SHA-1, SHA-224, SHA-256, SHA-384, SHA-512 for hashes and RSA, DSA, 
ECDSA for digifal signafures are defined). They are indicafed in descending order 
of preference by pairs, as some sysfems allow only cerfain combinafions. The cur- 
renf lisf of exfensions is given in [TLSEXT]. 

Previous versions of TLS had abouf a half-dozen exfensions, and [REC6066] 
updafes fhese exfensions for TLS 1.2. If defines fhe following exfensions: server_ 
name (DNS-sfyle name of fhe server being confacfed), max_fragmenf_lengfh 
(maximum lengfh of a message as 2" byfes for n having values 9-12), clienf_cerfifi- 
cafe_url (indicafes supporf for fhe CerfificafeURL handshake message used fo send 
fhe URL of a cerfificafe insfead of a complefe cerfificafe), frusfed_ca_keys (hashes 
or fhe names of frusfed CA public keys and/or cerfificafes), fruncafed_hmac (use 
fhe firsf 80 bifs of HMAC calculafions only), and sfafus_requesf (requesfs fhaf a 
server invoke OCSP and provide fhe DER-encoded response in a CerfificafeSfafus 
handshake message fo check a cerfificafe). Each of fhese exfensions may be pres- 
enf in an (exfended) ClienfHello message and in some circumsfances may appear 
in fhe ServerHello message fo indicafe agreemenf. Aside from fhese exfensions 
and fhe fwo handshake messages already menfioned, [REC6066] also defines 
four alerf messages: cerfificafe_unobfainable, unrecognized_name, bad_cerfifi- 
cafe_sfafus_response, and bad_cerfificafe_hash_value. These are self-explanafory 
and are nof senf unless fhe peer has demonsfrafed undersfanding of fhe exfended 
ClienfHello fype message. 

Several ofher exfensions have been defined or are reserved. The user_map- 
ping exfension [REC4681] provides a mefhod for providing confexf for fhe user 
idenfifier (e.g., Windows domain). Anofher expands fhe cerfjype exfension fo 
include nof only X.509 cerfificafes buf also OpenPGP cerfificafes [REC6091]. Ellip- 
fic curve cipher suifes are described by fhe informafional documenf [REC4492]. 
The Secure Remote Password protocol (SRP) can be infegrafed wifh TLS according fo 
fhe mefhods defined in fhe informafional documenf [RPC5054]. A use_srfp exfen¬ 
sion designed fo produce a version of fhe Secure Real-Time protocol (SRTP) based on 
DTLS (see Secfion 18.9.2) is given in [RPC5764]. A mefhod fo eliminafe fhe sfafe a 
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server must store to perform session resumption is given by the SessionTicket TLS 
extension [RFC5077]. It involves placing the necessary state in an encrypted form 
in the client. Finally, an important renegotiation_info extension is used to combat 
a renegotiation vulnerability We shall describe it in more detail next. 

18.9.1.4 Renegotiation 

TLS supports the ability to renegotiate cryptographic connection parameters while 
maintaining the same connection. This can be initiated by either the server or the 
client. If the server wishes to renegotiate the connection parameters, it generates a 
HelloRequest message, and the client responds with a new ClientHello message, 
which begins the renegotiation procedure. The client is also able to generate such 
a ClientHello message spontaneously, without prompting from the server. 

Support for renegotiation is optional but "highly recommended" and is used, 
for example, when sequence numbers are about to wrap. Renegotiation can be 
refused by generating a "no_renegotiation" (type 100) warning alert. Although 
this type of alert is not required to be terminal, receiving such an alert may, by 
local policy, result in connection termination. 

In 2009, a successful attack on TLS was demonstrated using the renegotiation 
capability. We describe it in more detail in Section 18.12. The vulnerability allows 
an attacker to establish a malicious TLS session with a server that can later be 
spliced into a subsequent legitimate session by a client using a MITM attack. The 
server believes that only a standards-compliant renegotiation has taken place. A 
solution to the problem, given in [RFC5746], involves binding any renegotiation 
more closely with the existing session using a TLS extension called renegotiation_ 
info (type OxffOl). When creating a new connection, renegotiation_info is empty. 
When client renegotiation takes place, it contains "client_verify_data," and when 
server renegotiation takes place it contains a concatenation of "client_verify_data" 
and "server_verify_data." The client_verify_data is defined to be the same verify_ 
data used with the Finished message sent by the client on the completion of the 
last handshake. This is a 12-byte value in TLS (36 for SSLv3). The server_verify_ 
data is defined to be the verify_data used with the Finished message sent by the 
server on completion of the last handshake. 

Some deployed TLS (and SSL) servers abort a connection when unknown 
extensions are present. To handle this issue when deploying the (relatively new) 
renegotiation_info extension, an alternative is available. The TLS cipher suite 
TLS_EMPTY_RENEGOTIATION_INFO_SCSV can be used during connection 
establishment to indicate the equivalent of an empty renegotiation_info exten¬ 
sion. This is using a signaling cipher suite value (SCSV) not to encode a real cipher 
suite, but instead to indicate a certain set of functions. (A similar trick is used in 
DNSSEC for NSEC3 records; see Section 18.10.1.3.) 

18.9.1.5 Example 

In the example shown in Eigure 18-31, we see the messages exchanged during a 
connection setup with TLS 1.2 using TCP/IP on the local loopback interface. The 
client and server have RSA certificates, which each provides to its peer. The initial 
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TCP handshake and window update, as well as the 127.0.0.1 source and destina¬ 
tion IPv4 addresses, are not shown. The trace has been annotated with right and 
left arrows for addifional clarify. The arrows poinfing fo fhe righf indicafe TCP 
segmenfs confaining af leasf one TLS message senf by fhe clienf headed for fhe 
server. Leff-poinfing arrows indicafe messages from fhe server fo fhe clienf. To see 
fhis oufpuf, Wireshark was fold fo decode fhe frace by firsf choosing SSL under 
fhe Analyze I Decode As ... menu. 
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Figure 18-31 A normal TLS 1.2 connection establishment as shown by Wireshark. The server runs 
on port 5556. Client messages sent to the server are highlighted by arrows pointing to 
the right. Server messages sent to the client are shown with left-pointing arrows. TCP 
ACKs are interspersed with the TLS messages. After the Change Cipher Spec mes¬ 
sage (segment 21), other messages are encrypted and authenticated. Segment 13 also 
includes the ServerHelloDone message. 


In Figure 18-31, after the initial TCP-level handshake, the TLS exchange begins 
with a ClientHello message. TCP pure ACKs are seen interspersed with the TLS 
messages. After the ChangeCipherSpec message has been processed, the subse¬ 
quent information is encrypted. To see what is happening in more detail, we shall 
expand the first few TLS messages. Figure 18-32 shows the detailed contents of the 
ClientHello message. 
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a Compression Methods (1 method) 

Compression Method: null (0) 

Extensions Length: 50 
a Extension: cert_type 

Type: cert_type (0x0009) 

Length: 3 
Data (3 bytes) 
a Extension: server_name 

Type: server_name (0x0000) 

Length: 14 
Data (14 bytes) 

a Extension: renegotiation_info 

Type: renegotiation_info (OxffOl) 

Length: 1 
Data (1 byte) 

a Extension: sessionTicket tls 

Type: SessionTicket TLS (0x0023) 

Length: 0 
Data (0 bytes) 

a Extension: signature_algorithms 

Type: signature_algorithms (OxOOOd) 

Length: 12 
Data (12 bytes) 


Figure 18-32 A ClientHello message in TLS 1.2 contains version information, supported cipher 
suites and compression algorithms, random data, and a number of extensions. Here, 
the client supports Diffie-Hellman key agreement as well as key exchange using RSA. 
It uses AES-256 in CBC mode for encryption and SHA-256 for integrity protection. 
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The ClientHello message detailed in Figure 18-32 is a Record protocol message 
carrying the ClientHello handshake message. It contains a 32-bit UNIX timestamp 
counting seconds since midnight, January 1, 1970, plus a random 28-byte value 
{ClientHello.random) used in forming keys. As this is a brand-new connection, its 
session ID is 0. Six bytes are devoted to carrying the client's three supported cipher 
suites in preference order (mosf preferred firsf). Each suife is encoded using a 
16-bif value specified by fhe TLS Cipher Suife Regisfry in [TLSPARAMS]. Only a 
single compression mefhod is supporfed—fhe NULL mefhod, which achieves no 
compression gain and is fypical. Also, 50 byfes are included for exfensions. The 
cerfjype exfension indicafes fhaf eifher X.509 or OpenPGP cerfificafes are under- 
sfood. The server_name exfension confains 127.0.0.1, which was fhe name of fhe 
server provided fo fhe clienf applicafion. The renegofiafion_info is empfy, as fhis 
is fhe firsf handshake, as is fhe SessionTickef TLS exfension. The signafure_algo- 
rifhms exfension indicafes fhaf fhe following combinafions can be processed by 
fhe clienf: shal-rsa, shal-dsa, sha256-rsa, sha384-rsa, and sha512-rsa. 

In fhis sample exchange, fhe server has been configured wifh only one cipher 
suife, TLS_DHE_RSA_WITH_AES_256_CBC_SHA256 (0x006b). The server indi¬ 
cafes fhis facf when responding fo fhe ClienfHello by using fhe ServerHello mes¬ 
sage shown in Eigure 18-33. 

In fhis figure, fhe server responds wifh a ServerHello message fo fhe clienf's 
ClienfHello. The server provides ifs copy of fhe currenf fime and ifs 28-byfe ran¬ 
dom value. If also includes a random 32-byfe session ID. The server supporfs only 
a single cipher suife (DH key agreemenf using RSA cerfificafes wifh AES-256 
encrypfion in CBC mode for encrypfion and SHA-256 for infegrify profecfion). 
Like fhe clienf, if does nof supporf any compression mefhods. If includes an empfy 
renegofiafion_info exfension and an empfy SessionTickef TLS exfension. Eollow- 
ing fhis firsf message, fhe server confinues wifh a Cerfificafe message, as shown 
in Eigure 18-34. 

The message in Eigure 18-34 carries fhe server's 841-byfe X.509v3 cerfificafe fo 
fhe clienf, which has been signed by a sample cerfificafe aufhorify called Tesf CA 
shown in fhe Issuer field. The field called SubjectPublickeylnfo confains fhe server's 
270-byfe public RSA key, which fhe clienf will use in aufhenficafing fhe server. 
There are six exfensions in fhe cerfificafe: basicConstraints (crifical), subjectAUName 
(confains a DNS name for fhe server using fhe cerfificafe), extKeyllsage (exfended 
key usage, indicafing fhaf fhe purpose of fhe key is for aufhenficafing a server), 
keyllsage (crifical; indicafes fhaf fhe enclosed key may be used for key encipher- 
menf or for generafing digifal signafures), subjectKeyIdentifier (a 20-byfe number 
idenfifying fhe signed public key), and fhe authorityKeyIdentifier (a 20-byfe number 
idenfifying fhe key used by fhe cerfificafe aufhorify fo produce fhis cerfificafe). 

The ClienfKeyExchange message is nof defailed as if mosfly includes binary 
informafion used in forming fhe DH exchange. The nexf message of inferesf is 
segmenf 13, which is a single TCP segmenf confaining bofh a CerfificafeRequesf 
message and a ServerHelloDone message. Eigure 18-35 shows fhe confenfs. 
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Is Frame 7: 146 bytes on wire (1168 bits), 146 bytes captured (1168 bits) 

s Null/Loopback 

B internet protocol version 4, src: 127.0.0.1 (127.0.0.1), Dst: 

127.0.0.1 (127.0.0.1) 


B Transmission Control Protocol, Src Port: 5556 (5556), 

Dst Port 

: 49710 (49710), Seq: 1, Ack: 

107, Len: 90 


Source port: 5556 (5556) 

Destination port: 49710 (49710) 

[stream index: 0] 

Sequence number: 1 (relative sequence number) 

[Next sequence number: 91 (relative sequence number)] 

Acknowledgement number: 107 (relative ack number) 

Header length: 32 bytes 
a Flags: 0x18 (psh, ack) 
window size value: 65535 
[Calculated window size: 524280] 

[window size scaling factor: 8] 
a checksum: 0xel9b [correct] 
a Options: (12 bytes) 
a [SEQ/ACK analysis] 
a [Timestamps] 
a Secure sockets Layer 

aTLSvl.2 Record Layer: Handshake Protocol: server Hello 
content Type: Handshake (22) 
version: TLS 1.2 (0x0303) 

Length: 85 

a Handshake protocol: server Hello 
Handshake Type: Server Hello (2) 

Length: 81 

version: tls 1.2 (0x0303) 
a Random 

gmt_unix_time: Dec 9, 2010 11:16:34.000000000 Pacific Standard Time 
random_bytes: 5269c5f99ba898138ee784e093d717a472b3e912f943afd0... 

Session ID Length: 32 

Session id: 8f3ea072cf9a5a061e65fd36fdbc3c7cbace8a8cfe738eb3... 
cipher suite: tls_dhe_rsa_with_aes_256_cbc_sha 256 (0x006b) 

Compression Method: null (0) 

Extensions Length: 9 
a Extension: renegotiation_info 
a Extension: SessionTicket TLS 

> 


Figure 18-33 A ServerHello message in TLS 1.2 contains version information, supported cipher 
suites and compression algorithms, and a number of extensions. Here, the client sup¬ 
ports Diffie-Hellman key agreement. It uses AES-256 for encryption and SHA-256 for 
integrity protection. 


Figure 18-35 shows a TCP segment containing both a CertificateRequest mes¬ 
sage and a ServerHelloDone message. The CertificateRequest is requesting the 
client to provide its certificate and to verify ifs aufhenficify using a subsequenf 
Cerf ificafeVerify message. The type of cerf if icafe requesfed should be signed using 
eifher RSA or DSS from fhe Tesf CA cerfificafe aufhorify. The signafure algorifhms 
lisfed are shal-rsa, shal-dsa, sha256-rsa, sha384-rsa, and sha512-rsa. 

Packef 15 (nof defailed) confains fhe Cerfificafe message fhaf has fhe cerfificafe 
chain for fhe clienf and ifs public key. In fhis case, fhe subjecf field confains "fesf 
clienf" and fhe issuer is Tesf CA. Thus, fhe clienf's and server's cerfificafes were 
signed by fhe same CA and fhe chain is a single cerfificafe. For fhe clienf fo prove 
fhaf if possesses fhe corresponding privafe key, if generafes fhe CerfificafeVerify 
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Is Frame 9: 912 bytes o 

n wire (7296 bits), 912 bytes captured (7296 bits) 

1 

B Nu1l/Loopback 

B Internet Protocol ve 
S Transmission control 

rsion 4, src: 127.0.0.1 (127.0.0.1), DSt: 127.0.0.1 (127.0.0.1) 
protocol, src port: 5556 (5556), Dst port: 49710 (49710), seq: 91, Ack: 107 

Len: 356 


source port: 5556 (5556) 

Destination port: 49710 (49710) 

[stream Index: 0] 

sequence number: 91 (relative sequence number) 

[Next sequence number: 947 (relative sequence number)] 

Acknowledgement number: 107 (relative ack number) 

Header length: 32 b/tes 
B Flags: 0x18 (psh, ack) 
window size value: 65535 
[calculated window size: 524280] 

[window size scaling factor: 8] 

B checksum: 0xd6fa [correct] 
a options: (12 bytes) 
a [SE(Vack analysis] 
a [Timestamps] 

B Secure Sockets Layer 

QTLSvl.2 Record Layer: Handshake Protocol: Certificate 
Content Type: Handshake (22) 
version: TLS 1.2 (0x0303) 

Length: 351 

a Handshake Protocol: Certificate 
Handshake Type: certificate (11) 

Length: 847 

certificates Length: 844 
a certificates (344 bytes) 
certificate Length: 841 

a certificate (1d-at-commonName=localhost,Id-at-organIzatlonName-test org) 
a signedcertificate 
version: v3 (2) 
serial Number: 1291919218 
a signature (shawithRSAEncryption) 

Algorithm Id: 1.2.840.113549.1.1.5 (shawithRSAEncryption) 
a issuer: rdnsequence (0) 

a rdnsequence: 1 item (id-at-commonName=Test CA) 

a RDNsequence item: 1 item (id-at-commonName=Test ca) 

a RelativeoistinguishedName item (id-at-commonName=Test ca) 

Id: 2.5.4.3 (id-at-commonName) 
a Directorystring: printableString (1) 
printableString: Test CA 

a validity 

BnotBefore: utcTime (0) 

utcTime: 10-12-09 18:26:58 (UTC) 
anotAfter: utcTime (0) 

UtcTime: 11-12-09 18:26:59 (UTC) 
a subject: rdnsequence ( 0 ) 

a rdnsequence: 2 items (id-at-commonName=localhost,1d-at-organ1zat1onName=test org) 
B RDNSequence item: l item (id-at-organl2at1onName=test org) 
a RDNSequence item: l item (1d-at-commonName=localhost) 
a subjectPublicKeyinfo 

a algorithm (rsasncryption) 

Algorithm id: 1.2.840.113549.1.1.1 (rsaEncryption) 

Padding: 0 

subjectPublicKey: 3082010a0282010100def6ef0a37a7742e66286cb58c4317... 
a extensions: 6 items 

a Extension (id-ce-basicconstraints) 

Extension id: 2.5.29.19 (id-ce-basicconstraints) 
critical: True 
Basicconstraintssyntax 
a Extension (id-ce-subjectAltName) 

Extension id: 2.5.29.17 (id-ce-subjectAltName) 
a General Names: 1 item 

a General Name: dNSName (2) 
dNSName: local host 
a Extension (id-ce-extKeyUsage) 

Extension Id: 2.5.29.37 (id-ce-extKeyUsage) 
a KeyPurposelDs: 1 item 

KeyPurposeld: 1.3.6.1.5.5.7.3.1 (id-kp-serverAuth) 
a Extension (id-ce-keyusage) 

Extension id: 2.5.29.15 (id-ce-keyusage) 
critical: True 
Padding: 7 

a Keyusage: aOOO (digitalsignature, keyEncipherment) 
a Extension (id-ce-subjectKeyidentifier) 

Extension id: 2.5.29.14 (1d-ce-subjectKeyidentifier) 
subjectKeyidentifier: a5e38f916a4bfbbe3096908fae61d59ff35e84l9 
a Extension (id-ce-authorityKeyidentifier) 

Extension id: 2.5.29.35 (id-ce-authorityKeyidentifier) 
a AuthorityKeyidentifier 

keyidentifier: 420796cd2ebb0e5e89aaafb9bl7d946a3dl97146 
a algorithmidentifier (shawithRSAEncryption) 

Algorithm id: 1.2.840.113549.1.1.5 (shawithRSAEncryption) 

Padding: 0 

encrypted: 138012e5d76d666f00d85583251c71ff70c53c5653200b84... 


Figure 18-34 Following the ServerHello, the server generates a Certificate message to carry its cer¬ 
tificate. The client can use the certificate to authenticate the server. The same message 
format is used when the server authenticates the client. 
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□ Frame 13: il3 bytes on wire (904 

bits), 113 bytes captured (904 bits) 

□ Null/Loopback 


□ internet protocol version 4, src 

127.0.0.1 (127.0.0.1), DSt: 127.0.0.1 (127.0.0.1) 


□ Transmission control protocol, src port: 5556 (5556), Dst port: 49710 (49710), seq: 
source port: 5556 (5556) 

Destination port: 49710 (49710) 

[Stream index: 0] 

Sequence number: 1479 (relative sequence number) 

[Next sequence number: 1536 (relative sequence number)] 

Acknowledgement number: 107 (relative ack number) 

Header length: 32 bytes 
S Flags: 0x18 (PSH, ACK) 
window size value: 65535 
[Calculated window size: 524280] 

[window size scaling factor: 8] 

□ checksum: 0x8d99 [correct] 

□ Options: (12 bytes) 

□ [seq/ack analysis] 

□ [Timestamps] 

S Secure Sockets Layer 

□ TLSvl.2 Record Layer: Handshake protocol: certificate Request 

content Type: Handshake (22) 
version: TLS 1.2 (0x0303) 

Length: 43 

S Handshake Protocol: Certificate Request 
Handshake Type: certificate Request (13) 

Length: 39 

certificate types count: 2 
S Certificate types (2 types) 

Certificate type: RSA Sign (1) 

Certificate type: DSS Sign (2) 
signature Hash Algorithms Length: 10 
S signature Hash Algorithms (5 algorithms) 

S signature Hash Algorithm: 0x0201 

signature Hash Algorithm Hash: SHAl (2) 

Signature Hash Algorithm Signature: RSA (1) 

□ Signature Hash Algorithm: 0x0202 

Signature Hash Algorithm Hash: shaI (2) 
signature Hash Algorithm signature: dsa (2) 

S signature Hash Algorithm: 0x0401 

signature Hash Algorithm Hash: SHA256 (4) 

Signature Hash Algorithm Signature: RSA (1) 
a signature Hash Algorithm: 0x0501 

Signature Hash Algorithm Hash: sha 384 (5) 
signature Hash Algorithm signature: rsa (1) 
a signature Hash Algorithm: 0x0601 

signature Hash Algorithm Hash: SHA512 (6) 

Signature Hash Algorithm Signature: RSA (1) 

Distinguished Names Length: 22 
a Distinguished Names (22 bytes) 

Distinguished Name Length: 20 
a Distinguished Name: (id-at-commonName=Test CA) 

Q RDNSequence item: 1 item (id-at-commonName=Test CA) 

B RelativeDistinguishedName item (id-at-commonName=Test CA) 

Id: 2.5.4.3 (id-at-commonName) 
a Directorystring: printablestring (1) 
printablestring: Test CA 

BTLSvl.2 Record Layer: Handshake Protocol: Server Hello Done 
Content Type: Handshake (22) 
version: tlsI. 2 (0x0303) 

Length: 4 

a Handshake protocol: server Hello Done 
Handshake Type: server Hello Done (14) 

Length: 0 

I > 


Figure 18-35 The server's CertificateRequest and ServerHelloDone messages are contained in the 
same TCP segment. The client can use the certificate to authenticate the server. The 
same message format is used when the server authenticates the client. 
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message (packet 19). The CertificateVerify message contains a signature on a hash 
of all the session's handshake messages sent or received so far, signed using fhe 
privafe key of fhe clienf. This proves nof only fhaf fhe clienf is aufhenfic, buf fhaf 
if has parficipafed appropriafely in fhe TLS exchanges up fo fhis poinf and nof 
losf or reordered any messages. Affer fhe CerfificafeVerify message, fhe Change 
Cipher message begins fhe subsequenf (encrypfed) communicafion. 

18.9.2 TLS with Datagrams (DTLS) 

The TLS protocol assumes a stream-based underlying transport protocol for deliv¬ 
ering its messages. A datagram version (DTLS) relaxes this assumption but aims 
to otherwise achieve the same security goals as TLS using essentially all the same 
message formats. It was originally motivated by protocols such as SIP that run on 
UDP but do not care to use IPsec [RFC5406]. DTLS has also been adapted for use 
with DCCP [RFC5238] and SCTP [RFC6083]. The current version at the time of 
writing is DTLS 1.0 [RFC4347], based on TLS 1.1. An update, based on TLS1.2, is in 
the works [IDDTLS]. It uses the same protocol layering shown in Figure 18-28 and 
most of the same message exchanges. 

The main challenge of providing TLS-like service without a reliable trans¬ 
port is that datagrams may get lost, reordered, or duplicated. These problems 
can affect encryption and the Handshake protocol, both of which have ordering 
dependencies in TLS. To handle them, DTLS adds an explicit sequence number to 
each record carried by the record layer (they were implicit with regular TLS) and 
a timeout-based retransmission scheme with (different) sequence numbers from 
those used by the Handshake protocol. 

18.9.2.1 DTLS Record Layer 

In TLS, the ordering of records is important because the MAC computation of 
one record depends on its predecessor. More specifically, the MAC computation 
depends on an implicit 64-bit sequence number for each record that is incorrect 
in the presence of datagram reordering or loss. To remedy this problem, DTLS 
uses explicit sequence numbers at the record layer. These sequence numbers are 
reset to the value 0 after each ChangeCipherSpec message is sent. They are used 
in combination with an additional 16-bit epoch number incorporated into each 
record's header. The epoch number is incremented by 1 for each change of cipher 
state. This handles the situation where multiple messages containing the same 
sequence number, generated as a result of multiple proximate handshakes, might 
be in flight simultaneously. 

The MAC computation in DTLS is modified from its TLS counterpart to 
include the 64-bit concatenation of the two new fields (epoch first, followed by 
sequence number). This allows each record to be handled independently. Note 
that with TLS, a bad MAC results in connection termination. With DTLS, a full 
connection abort is not necessary, and a receiver may choose to simply discard the 
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record containing the invalid MAC or send an alert message (which, if generated, 
must be terminal). 

Duplicates are simply dropped or are optionally considered as a replay and 
possible attack. Replay detection, if supported, is based on keeping a window of 
current sequence numbers at the receiver. The window is required to be at least 32 
messages but is suggested to be at least 64. The scheme is similar to that used in 
IPsec for AH and ESP. Records arriving with sequence numbers less than the left 
window edge are silently discarded as old or duplicative. Those within the win¬ 
dow are checked as possible duplicates. A message within the window carrying 
a valid MAC is kept, even if out of order. Those with invalid MACs are discarded. 
Those with valid MACs that exceed the right window edge cause the right win¬ 
dow edge to be advanced. Thus, the right window edge represents the validated 
message with the highest sequence number. 

A single datagram may contain multiple DTPS records, but no single record 
may span multiple datagrams. The record layer allows applications to implement 
a PMTUD process similar to TCP's (see Chapter 15) and avoids sending datagrams 
it believes are likely to be fragmented. Indeed, applications are supposed to receive 
an error indication if they attempt to send application messages that exceed the 
PMTU or maximum application datagram size (PMTU minus DTPS overhead). 
An exception to this rule is how DTPS handles the Handshake protocol, which can 
involve relatively large messages. 

18.9.2.2 DTLS Handshake Protocol 

Handshake protocol messages can be as large as 2^“’ -1 bytes but in practice are sev¬ 
eral kilobytes. This can exceed a typical maximum UDP datagram size of 1.5KB. 
To handle this situation, a Handshake protocol message may span multiple DTPS 
records using a fragmentation procedure. Each fragment is contained in a record, 
which is contained in an underlying datagram. To implement fragmentation, each 
Handshake message contains a 16-bit Sequence Number field, a 24-bit Fragment Off¬ 
set field, and a 24-bit Fragment Length field. 

To perform fragmentation, the original message's content is divided into mul¬ 
tiple contiguous data ranges. Each range is required to be less than the maxi¬ 
mum fragment size. Each range is placed in a message fragment. Each fragment 
contains the same sequence number as the original message. The Fragment Off¬ 
set and Fragment Length fields are expressed in bytes. Senders avoid overlapping 
data ranges, but receivers are required to handle this possibility because senders 
may be required to adjust their record size over time and retransmissions may be 
necessary. 

To handle message loss, DTPS implements a simple timeout and retransmis¬ 
sion capability that operates on groups of messages called flights. Eigure 18-36 
shows both the full (left) and abbreviated (right) establishment exchanges, along 
with the DTPS Handshake protocol state machine. 

In Eigure 18-36, flight numbers are given in the area between the full and 
abbreviated exchanges. The full exchange is very similar to the full TPS exchange 
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Full Exchange 

Client Server 


Abbreviated Exchange 
(Reconnect) 

Client Server 



Figure 18-36 In DTLS, the possibility of lost datagrams must be handled. The initial full exchange 
(left) comprises six "flights" of information, each of which can be retransmitted. The 
DTLS abbreviated exchange (top right) uses only three and differs slightly from TLS. 
DTLS maintains a three-state finite state machine (bottom right) when processing the 
protocol. 


shown in Figure 18-30, except for the additional HelloVerifyRequest and second 
ClientHello messages (which now contain cookies). The abbreviated exchange is 
different, however. In DTLS the server sends the first Finished message, whereas 
in TLS the client sends the first Finished message. 

The lower right portion of Figure 18-36 depicts the state machine used by 
DTLS implementations when performing the Handshake protocol. There are 
three primary states: Preparing, Sending, and Waiting. The client starts in the Pre¬ 
paring state as it creates its ClientHello message. The server begins in the Waiting 
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state with no buffered messages or acfive refransmission fimer. When sending, a 
refransmission fimer is sef and fhe Waifing sfafe is enfered upon complefion of 
fhe fransmission. Expirafion of fhe refransmission (RTX) fimer brings fhe proto¬ 
col back fo fhe Sending sfafe to perform a refransmission, as does fhe receipf of 
a refransmiffed flighf from fhe peer. In fhis laffer case fhe local system performs 
a refransmission of ifs own flighf wifh fhe rafionale fhaf ifs previous fransmis¬ 
sion musf have been parfially or complefely losf, as indicafed by fhe presence of 
peer refransmission. If everyfhing goes well, a flighf is received, and fhe local 
sysfem eifher finishes or refurns fo fhe Preparing sfafe fo form ifs nexf flighf for 
fransmission. 

The sfafe machine is driven by a refransmission fimer wifh a recommended 
defaulf value of Is. If no response for a flighf has been received wifhin fhe fimeouf 
durafion, fhe flighf is refransmiffed using fhe same Handshake profocol sequence 
numbers; record-layer sequence numbers sfill advance. Subsequenf refransmis- 
sions wifhouf a response resulf in doubling of fhe RTX fimeouf value, up fo a value 
of af leasf 60s. This value may be resef affer a successful fransmission or a long idle 
period (fen fimes fhe currenf fimer value or more). 

18.9.2.3 DTLS DoS Protection 

When dafagrams are used insfead of a reliable byfe sfream profocol, some addi- 
fional securify considerafions come info play. Of special concern are fwo pofenfial 
DoS affacks. If is relafively simple for an affacker fo forge a source IP address when 
sending a ClienfHello message. Many such messages could cause a DoS attack 
af fhe DTLS server because of exhausfion of processing resources when forming 
responses. A varianf of fhis affack involves having mulfiple attacking machines 
include fhe same forged source (vicfim) IP address. The responding server(s) 
fhen send(s) responses fo fhe vicfim's IP address, causing fhe vicfim machine fo 
undergo a DoS affack. 

A sfafeless cookie validafion procedure incorporafed info fhe Hello exchange 
helps resisf bofh DoS affacks. When a server receives a ClienfHello message, if 
generafes a new HelloVerifyRequesf message confaining a 32-bif cookie (which 
may be a funcfion of a secref, fhe clienf's IP addresses, and fhe connecfion param¬ 
eters). A subsequenf ClienfHello message musf confain a copy of fhe appropri¬ 
ate cookie. Ofherwise, fhe server refuses fhe exchange. This allows fhe server fo 
quickly dispense wifh requesfs fhaf do nof provide valid cookies. If does nof pro- 
fecf againsf coordinafed affacks from mulfiple legifimafe IP addresses fhaf can 
complefe fhe cookie exchange. 


18.10 DNS Security (DNSSEC) 

Now fhaf we have discussed popular securify protocols af fhe link, nefwork, and 
fransporf layers, we move fo fhe applicafion layer. Alfhough if is nof yef widely 
deployed af fhe fime of wrifing, we shall focus on how fo provide enhanced 
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security for the Domain Name System (DNS). Security for DNS covers both data 
within the DNS (resource records or RRs) as well as security of transactions that 
synchronize or update contents of DNS servers. Given its important role in the 
operation of the Internet, a major effort has been undertaken to deploy these secu¬ 
rity mechanisms. The mechanisms are called the Domain Name System Security 
Extensions (DNSSEC) and are discussed in a family of RFCs [RFC4033][RFC4034] 
[RFC4035]. These RFCs are sometimes referred to as DNSSECbis because they 
replace an earlier set of specifications for DNSSEC. As we explore DNSSEC in 
further detail, it may be worthwhile to review the description of basic DNS (see 
Chapter 11). 

The extensions provide origin authentication and integrity assurance for DNS 
data, along with a (limited) key distribution facility. That is, the extensions provide 
a cryptographically secure way to determine what entity has authored a block of 
DNS information and that the information has been received unaltered. DNSSEC 
also provides authenticated nonexistence. DNS responses indicating the nonexis¬ 
tence of a particular domain name include protection similar to that of responses 
for existing domain names. DNSSEC does not provide privacy (confidentiality) 
of DNS information, DoS protection, or access control. Transaction security, used 
with DNSSEC, is defined separately, and we will mention it briefly after discuss¬ 
ing the core DNSSEC data security capabilities. 

DNSSEC accommodates resolvers with varying levels of security "aware¬ 
ness." A validating security-aware resolver (also called validating resolver) checks 
cryptographic signatures to ensure that the DNS data it handles is secure. Other 
resolvers, including stub resolvers on hosts and the "resolver side" of recursive 
name servers, may be security-aware but may not perform cryptographic valida¬ 
tion. Instead, such resolvers should establish secure associations with validating 
resolvers. We shall focus on the validating resolvers, as they are the most sophis¬ 
ticated and interesting. When operating, they are able to ascertain whether DNS 
information is secure (valid with all signatures checked), insecure (valid signatures 
indicate that something should not be present but is), bogus (proper data appears 
to be present but cannot be validated for some reason), or indeterminate (veracity 
cannot be determined, usually because of lack of signatures). The indeterminate 
case is the default case when no other information is available. 

DNSSEC works securely only when a zone is signed by a domain administra¬ 
tor, there is some basis for trust, and both server and resolver software participate. 
Validating resolvers check signatures to ensure that DNS information is secure, 
and they must be configured with one or more initial trust anchors that are simi¬ 
lar to root certificates in a PKI. Note, however, that DNSSEC is not a PKI; in par¬ 
ticular, it provides only limited signing and key revocation. It does not implement 
an analog to certificate revocation lists [RFC5011]. 

When performing a DNS query with DNSSEC, a security-aware resolver 
uses EDNSO and enables the DO (DNSSEC OK) bit in an OPT meta-RR pres¬ 
ent in the request. This bit indicates the client's interest in and ability to process 
DNSSEC-related information along with its support for EDNSO. The DO bit is 
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the first (high-order) bit of the second 16-bit field in the "extended RCODE and 
flags" portion of the EDNSO meta-RR (see Section 3 of [REC3225] and Section 4 of 
[REC2671]). Servers that receive requests in which the DO bit is not set (or pres¬ 
ent) are prohibited from returning most of the RRs discussed in Section 18.10.1 
unless such records are explicitly asked for in the request. This helps to improve 
DNS performance because it avoids having to carry security-related RRs that are 
never processed by security-unaware resolvers. This can be especially beneficial 
because DNS typically uses relatively small UDP packets and falls back to using 
TCP, which increases latency due to its three-way handshake, for large responses. 

When a server processes a request from a DNSSEC-enabled resolver, it checks 
the CD (checking disabled) bit in the DNS request (see Chapter 11). If set, this 
indicates that the client is willing to accept nonvalidated data in a response. When 
preparing a response, a server ordinarily validates the data it is returning crypto¬ 
graphically. Successful validation results in the AD (authentic data) bit being set 
in the response [RPC4035]. A security-aware but nonvalidating resolver can in 
principle trust this information if it has a secure path to the server. However, the 
arguably best case is to use validating stub resolvers that perform cryptographic 
validation and consequently set the CD bit on queries. This provides end-to-end 
security of the DNS (i.e., an intermediate resolver need not be trusted), and it 
reduces the computational load on the intermediate servers that would otherwise 
have to perform cryptographic validation. 

18.10.1 DNSSEC Resource Records 

As specified in [RPC4034], DNSSEC uses four new resource records (RRs) and two 
message header bits (CD and AD). It also requires EDNSO support and uses the 
DO bit field we mentioned previously. Two of the four RRs are used to contain sig¬ 
natures for portions of the DNS name space, and the other two are used in helping 
to distribute and validate keys. A change in [REC5155] created two additional new 
RRs, intended to replace one of the original four. 

18.10.1.1 DNS Security (DNSKEY) Resource Records 

We begin by looking at how DNSSEC stores and distributes keys. DNSSEC uses 
the DNSKEY resource record to hold public keys. The keys are intended for use 
with DNSSEC only; other RRs (e.g., the CERT RR [REC4398]) may be useful for 
holding keys or certificates for other purposes. The format of the RDATA portion 
of a DNSKEY RR is shown in Eigure 18-37. 

The Flags field in Eigure 18-37 has 3 bits currently defined. Bit 7 is the Zone Key 
bit field. If set, the DNSKEY RR owner's name must be the name of a zone and the 
included key is called either a Zone Signing Key (ZSK) or a Key Signing Key (KSK). 
If not set, the record holds some other kind of DNS key that cannot be used for 
validating signatures for zones. Bit 15 is called the Secure Entry Point (SEP) bit. It 
is a hint that can be used by debugging or signing software to make an informed 
guess as to the purpose of the key. Signature validation does not interpret the SEP 
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Figure 18-37 The RDATA portion of the DNSKEY RR contains a public key used only for DNSSEC. 

The Flags field includes a Zone Key indicator (bit 7), a Secure Entry Point indicator (bit 
15), and Revoked indicator (bit 8). Generally, the zone key is set for all DNSSEC keys. 
If the advisory SEP bit is also set, the key is typically called a key signing key and is 
used for validating delegations to child zones. If not, the key is usually a zone sign¬ 
ing key, has a shorter validity period, and is typically used to sign zone contents and 
not delegations. The included key is to be used with the algorithm specified in the 
Algorithm field. 


bit, but keys with this bit set are usually KSKs and are used to secure the DNS 
hierarchy by validating keys in child zones (via DS records; see Section 18.10.1.2). 
Bit 8 is the Revoked bit [RFC5011] if set the key cannot be used for validafion. 
The Protocol field holds fhe value 3 for fhis version of DNSSEC. The Algorithm 
field indicafes fhe signing algorifhm [DNSSECALG]. Only DSA and RSA wifh 
SHA-1 (values 3 and 5, respecfively) are defined for use wifh DNSKEY RRs accord¬ 
ing fo [REC4034], buf addifional specificafions supporf ofher algorifhms (e.g., see 
[REC5933] for ECC-GOST (value 12), [REG5702] for SHA-256 (value 8)). These val¬ 
ues are also used wifh several of fhe ofher DNSSEC RRs. The Public Key field holds 
a public key whose formal depends on fhe Algorithm field. 

18.10.1.2 Delegation Signer (DS) Resource Records 

A delegation signer (DS) resource record is used fo refer fo a DNSKEY RR, usu¬ 
ally from a parenf zone fo a descendanf zone. These records are used during fhe 
aufhenf icaf ion process fo verify a public key (see Secfion 18.10.2). The DS RR for¬ 
mal is shown in Eigure 18-38. 
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Figure 18-38 The RDATA portion of the DS RR contains a nonunique reference to a DNSKEY RR in 
the Key Tag field. It also contains a message digest of the DNSKEY RR and its owner, 
plus indications of the type of digest and algorithm. 
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The Key Tag field in Figure 18-38 is a reference fo a DNSKEY RR. However, if 
is nof unique. Mulfiple DNSKEY RRs may have fhe same fag value, so fhe field is 
used only as a search hinf (confirming fhaf validafion is sfill necessary). The value 
for fhis field is compufed as fhe 16-bif unsigned sum of dafa comprising fhe refer¬ 
enced DNSKEY RR RDATA area (carries are ignored) as shown in Eigure 18-37. The 
Algorithm field uses fhe same values as fhe DNSKEY RR Algorithm field. The Digest 
Type field indicafes fhe fype of signafure used. Only value 1 (SHA-1) is defined by 
[REC4034], buf SHA-256 (value 2) is specified for use by [REC4509]. The currenf 
lisf is confained in fhe DS RR Type Digesf Algorifhms regisfry [DSRRTYPES]. The 
Digest field confains fhe digesf of fhe DNSKEY RR being referenced. More specifi¬ 
cally, fhe digesf is compufed as follows: 

digesf = digesf_algorifhm(DNSKEY owner name I DNSKEY RDATA) 

where I is fhe concafenafion operator and fhe DNSKEY RDATA value is compufed 
from fhe referenced DNSKEY RR as follows: 

DNSKEY RDATA = Flags I Protocol I Algorithm I Public Key 

Eor fhe case of SHA-1, fhe digesf is 20 byfes in lengfh. Eor SHA-256 if is 32 
byfes. The DS RR is used fo provide a downward link in fhe aufhenficafion chain 
across zone boundaries, so fhe referenced DNSKEY RR musf be a zone key (i.e., bif 
7 of fhe Flags field in fhe DNSKEY RR musf be sef). 


Note 

At the time of writing, a variant of the DS RR cailed DS2 is under consideration 
[IDDS2]. it introduces a Canonical Signer Name to the DS RR so that muitipie 
zones with identical content can be named differently and signed by multiple (dif¬ 
ferent) signers. In addition, there is a DLV RR [RFC4431] that has been used to 
provide delegations in cases where a parent zone is not signed or has not pub¬ 
lished DS RRs. The format of a DLV RR is identical to that of a DS RR; only the 
interpretation differs. 


18.10.1.3 NextSECure (NSEC and NSEC3) Resource Records 
Now that we have seen the RRs needed to hold and securely refer to keys, we move 
on to the records used to validate the structure of a zone and the resource records 
it contains. The NextSECure (NSEC) RR is used to hold the "next" RRset owner's 
domain name in the canonical ordering of names (see Section 18.10.2.1) or a delega¬ 
tion point NS type RRset. (Recall, an RRset is a set of RRs with the same owner, 
class, TTL, and type but with different data.) It also holds a list of RR types present 
at the NSEC RR owner's name. This provides authentication and integrity verifica¬ 
tion for the zone structure. The format of an NSEC RR is shown in Eigure 18-39. 
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Figure 18-39 The RDATA portion of the NSEC RR contains the name of the next RRset owner for the 
zone in canonical order. It also contains an indication of which RR types were present 
at the NSEC RR owner's domain name. 


The NSEC RR is used to form a chain of names corresponding fo RRsefs 
wifhin a zone. Consequenfly, an RRsef not presenf in fhe chain can be shown fo 
nof exisf. This provides fhe aufhenficafed denial of exisfence feafure menfioned 
previously. The Next Domain Name field holds fhe nexf enfry in fhe canonically 
ordered domain name chain for fhe zone wifhouf using fhe domain name com¬ 
pression fechnique described in Chapfer 11. The value of fhis field for fhe lasf 
NSEC record of fhe chain is fhe zone apex (fhe owner name of fhe zone's SOA RR). 

The Type Bit Maps field of fhe NSEC RR holds a bifmap of RR fypes presenf af 
fhe NSEC RR owner's domain name. There is a maximum of 64K possible fypes, 
abouf 100 of which have been defined fo dafe [DNSPARAMS]. Only a fracfion of 
fhese are in widespread use. Eor example, fhe Infernef's roof zone (domain name 
which became operafional wifh DNSSEC on July 15, 2010, confains a Next 
Domain Name field of ac (a ccTLD) and a bifmap indicafing fhe presence of records 
of fhe following fypes: NS, SOA, RRSIG, NSEC, and DNSKEY. 

To encode fhe presence of a type, the whole space of RR fypes is divided info 
256 "window blocks," numbered 0 fhrough 255. Eor each block number, fhe pres¬ 
ence of up fo 256 RR fypes can be encoded using a bif mask. Given a block num¬ 
ber N and bif posifion P, fhe corresponding RR type number is {N*256 + P). Eor 
example, in block 1, bit position 2 corresponds to RR type 258 (a type not currently 
defined). The field is encoded as follows: 

Type Bit Maps = (window block number I bifmap lengfh I bifmap)* 

where I is fhe concafenafion operator and represenfs Kleene closure (i.e., zero or 
more). Each insfance of fhe window block number confains a value in fhe range 
0-255, and fhe bifmap lengfh confains fhe lengfh of fhe corresponding bifmap 
in bytes (maximum value 32). The window block number and bifmap lengfh are 
each single bytes, and fhe bifmap can be as long as 32 byfes (256 bifs, one for each 
possible RR type in the window). Blocks in which no RR type is present are not 
included. The encoding is optimized for a sparse presence of fypes across blocks. 
Eor example, if only RR fypes 1 (A) and 15 (MX) were presenf, fhe encoding for fhe 
field would be as follows: 0x00024001 = (0x00 I 0x02 I 0x4001). 



900 Security: EAP, IPsec, TLS, DNSSEC, and DKIM 


The original structure of NSEC records defined in [RFC4034] creafes a sifua- 
fion in which anyone is able fo enumerafe fhe aufhorify records in a zone by walk¬ 
ing fhe NSEC chain, called zone enumeration. This is an unwanfed opporfunify 
for "leakage" of informafion for many deploymenfs. As a resulf, a pair of RRs, 
infended fo replace NSEC, is defined in [REC5155]. The firsf is called NSEC3. If 
uses cryptographic hashes of RR owner domain names rafher fhan unencoded 
domain names. The formaf is shown in Figure 18-40. 
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Figure 18-40 The RDATA portion of the NSEC3 RR contains a hash of the name of the next RRset 
owner for the zone in canonical order. The hash function has been applied the number 
of times specified in the Iterations field. The variable-length Salt value is appended to 
the name prior to applying the hash function to provide dictionary attack resistance. 
The Type Bit Maps field uses the same structure as NSEC RRs. NSEC3PARAM records 
are similar but contain only the hash parameters (not the Next Hashed Owner or Type 
Bit Maps fields). 


In fhe NSEC3 record, fhe Hash Algorithm field idenfifies fhe hash funcfion 
applied fo fhe nexf owner name fo produce fhe Next Hashed Owner field. Only 
SHA-1 (value 1) is defined fo dafe [NSEC3PARAMS]. The low-order bif of fhe Flags 
field confains an opt-out flag. If sef, if indicafes fhaf fhe NSEC3 record may cover 
unsigned delegafions. This is used in cases where a delegafion (NS RRsef) refers 
fo a child zone fhaf is nof required fo be or is nof desired fo be signed. The Itera¬ 
tions field indicates how many fimes fhe hash funcfion has been applied. A larger 
number of iferafions may help fo profecf againsf finding fhe owner names cor¬ 
responding fo hash values found in NSEC3 records (dicfionary attacks). The Salt 
Length field gives fhe lengfh of fhe Salt field in byfes. The Salt field confains a value 
appended fo fhe original owner name prior fo compufing fhe hash funcfion. Ifs 
purpose is fo help fhwarf dicfionary attacks. 

The second RR specified by [RFC5155] is called fhe NSEC3PARAM RR (nof 
shown separately). If uses fhe same formaf as fhe NSEC3 RR, excepf fhe Hash 
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Length, Next Hashed Owner, and Type Bit Maps fields are not present. It is used by 
an authoritative name server when choosing NSEC3 records to use in a negative 
response. The NSEC3PARAM RR provides the parameters needed for computing 
a hashed owner name. 

To obtain the hash value for the Next Hashed Owner field, the following com¬ 
putation is performed: 


IH(0) = H(owner name I Salt) 

IH(k) = H(IH(k -1) I Salt) if k > 0 
Next Hashed Owner = li{lli{Iterations) I Salt) 

where H is the hash function specified in the Hash Algorithm field and the owner 
name is in canonical form. The iterations and salt values are taken from the cor¬ 
responding fields of the NSEC3 RR. 

To avoid confusion between NSEC and NSEC3 RR types, [REC5155] allocates 
and requires the use of special security algorithm numbers 6 and 7 as aliases 
for identifiers 3 (DSA) and 5 (SHA-1) in zones employing NSEC3 RRs. Resolv¬ 
ers unaware of the NSEC3 record type receiving these values treat the resulting 
records as insecure. This provides a certain limited form of backward compatibil¬ 
ity (i.e., failing, but doing so without incorrectly interpreting RR data). 

18.10.1.4 Resource Record Signature (RRSIG) Resource Records 
Moving from the DNS structure to its contents, we require a way to provide ori¬ 
gin authentication and integrity protection for RRs. DNSSEC signs and validates 
signatures on RRsets using the Resource Record Signature (RRSIG) RR, and every 
authoritative RR in a zone must be signed (glue records and delegation NS records 
present in parent zones aren't). An RRSIG RR contains a digital signature for a 
particular RRset, along with information to identify which public key can be used 
to validate the signature, as shown in Eigure 18-41. 

The Type Covered field indicates the type of the RRset to which the signature 
applies. The value is taken from the standard set of RR types in [DNSPARAMS]. 
The Algorithm field indicates the signing algorithm. Only DSA and RSA with 
SHA-1 (values 3 and 5, respectively) are defined for use with RRSIG RRs according 
to [RPG4034], but [RPG5702] covers SHA-2 algorithms and [RPG5933] covers GOST 
algorithms (from the Russian Pederation). The Labels field gives the number of 
labels in the original owner's name of the RRSIG RR. The Original TTL field holds 
a copy of the TTL from the RRset as it appears in the authoritative zone (caching 
name servers may reduce the TTL). The Signature Expiration and Signature Inception 
fields indicate the starting and ending validity times for the signature, expressed 
in seconds since January 1,1970, 00:00:00 UTG. The Key Tag field helps to identify 
the DNSKEY RR that can be used to obtain the public key necessary to validate the 
signature contained in the Signature field, using the format described previously 
for the DS RR. 
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Figure 18-41 The RDATA portion of the RRSIG RR contains a signature for an RRset. The TTL of 
the RRset as it appears in the authoritative zone is also included, along with indicators 
of the algorithm and signature validity period. The Key Tag field refers to a DNSKEY 
RR containing a public key that can be used to validate the signature. The Labels field 
indicates how many labels constitute the original owning name of the RR. 


18.10.2 DNSSEC Operation 

Now that we have covered all the RRs required by DNSSEC, we can see how to 
use DNSSEC to secure zones. We shall first require the definition of a canonical 
ordering, mentioned earlier when defining the NSEC and NSEC3 record types. The 
purpose of a defined canonical ordering for a zone is to be able to enumerate a 
zone's contents in a reproducible way that can be signed (different orders of the 
same contents would produce different values for any good hash function). Once 
we are familiar with the ordering, we look at how a zone is signed and how signed 
records describing a zone are validated. 

18.10.2.1 Canonical Orderings and Forms 

There are three canonical orderings of interest to us: the canonical name order 
within a zone, the canonical form for a single RR, and the canonical ordering of 
an RRset [REC4034]. Recall from Chapter 11 that each RR has an owner name 
(owner's domain name) consisting of labels. By treating each label in a name as a 
left-justified string of bytes and treating uppercase US-ASCII letters as lowercase, 
we can form a list of names. We first sort the names by their most significant 
(right-most) label, then by the next most significant label, and so on. The absence 
of a byte sorts before a zero-value byte. A valid canonical ordering would be com, 
company.com, *.company.com, UK.company.COM, usa.company.com. Wild¬ 
cards can be used. 
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For a particular RR, there is a well-defined canonical form. This form requires 
fhe RR fo adhere fo fhe following rules: 

1. Every domain name is an FQDN and fully expanded (no compression 
labels). 

2. All uppercase US-ASCIl leffers in fhe owner name are replaced by fhe cor¬ 
responding lowercase versions. 

3. All uppercase US-ASCII leffers are replaced by fheir lowercase versions 
for any domain names presenf in fhe RDATA porfion of records wifh fype 
numbers 2-9,12,14,15,17,18, 21, 24,26, 33, 35,36, 39, and 38. 

4. Any wildcards (’^) are nof subsfifufed. 

5. The TTL is sef fo ifs original value as if appeared in fhe originafing aufhori- 
fafive zone or fhe Original TTL field of fhe covering RRSIG RR. 


Note 

A number of clarifications and important changes are being applied to the base¬ 
line DNSSECbis family of documents. The reader Is encouraged to consult the 
most recent version of [IDDCIN] for further details. 


The canonical order of fhe RRs wifhin an RRsef follows essenfially fhe same 
rule as for owner names buf applies fo an RR's RDATA confenfs in canonical form 
freafed as a leff-jusfified byfe sfring. 

18.10.2.2 Signed Zones and Zone Cuts 

DNSSEC depends on signed zones. Such zones include RRSIG, DNSKEY, and 
NSEC (or NSEC3) RRs and may confain DS RRs if fhere is a signed delegafion 
poinf. Signing makes use of public key cryptography where fhe public keys are 
sfored in and disfribufed by fhe DNS. Figure 18-42 shows an absfracf delegafion 
poinf befween a parenf and child zone. 

In fhe figure, fhe parenf zone confains ifs own DNSKEY RR, which provides 
fhe public key corresponding fo fhe privafe key used fo sign all aufhorifafive 
RRsefs in fhe zone using RRSIG RRs (mulfiple DNSKEYs are possible). A DS RR 
in fhe parenf provides a hash of one of fhe DNSKEY RRs in fhe child's apex. This 
esfablishes a chain of frusf from fhe parenf fo fhe child. A validafing resolver fhaf 
frusfs fhe parenf's DS RR can validafe fhe child's DNSKEY RR and ulfimafely fhe 
RRSIGs and signed RRsefs wifhin fhe child zone. This happens only if fhe valida¬ 
tor has a roof of frusf fhaf can be connecfed fo fhe parenf's DNSKEY RR. 

18.10.2.3 Resolver Operation Example 

Given a chain of signed zones and a securify-aware validafing resolver, we can 
see how fhe confenfs of a DNS response can be validated. In fhe besf case, a zone 
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Figure 18-42 A zone cut for an authenticated delegated zone includes a DS RR in the parent containing a hash 
of the DNSKEY RR(s) in the child. All RRsets are signed with corresponding RRSIG RRs except 
the delegation NS RRs (and glue records) in the parent. NSEC RRs can be used to verify the 
types present in the zone and include an SOA RR type indication at the apex in the child zone. 


can be reached through a chain of trust from the root zone. ICANN keeps a list of 
which zones have been enabled for DNSSEC by having DS records present in the 
root zone and signed DNSKEY RRs [TLD-REPORT], 

Assume that we wish to resolve and verify an A RR type for the domain name 
www.icann.org. Proceeding from the root downward, we shall at first require 
the root's trust anchor (i.e., DNSKEY RRs), DS records for org. contained in one 
of the root name servers, and perhaps RRSIG and NSEC (NSEC3) records. We 
then repeat the process using the org. and icann.org. domain names and cor¬ 
responding DNS servers. We begin with the root zone: 


Linux% dig @a.root-servers.net. . dnskey +noquestion +nocoiTunents \ 
+nostats +multiline 

;; Truncated, retrying in TCP mode. 

; «>> DiG 9.7.2-P3 «>> @a. root-servers. net. . dnskey 
+noquestion +nocomments +nostats +multiline 
; {1 server found) 

;; global options: +cmd 

86400 IN DNSKEY 257 3 8 { AwEAAagAIKl ... ) ; key id = 19036 
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86400 IN DNSKEY 256 3 8 ( AwEAAbSgVAz ... ) ; key id = 21639 

86400 IN DNSKEY 256 3 8 ( AwEAAcAPhPM ... ) ; key id = 40288 

Here we can see the trust anchor for the root zone, which constitutes the root 
of frusf for all DNSSEC in fhe Infernef. The firsf key is a KSK, indicafed by fhe 
value 257 {SEP bif is 1), which is fhe preferred one used in forming frusf chains. 
The ofhers are marked as ZSKs. Nexf, we would like fo ensure fhaf all fhe records 
we have jusf seen are supposed fo be presenf and have appropriafe signafures. The 
roof's RRSIG records of inferesf can be seen as follows: 


Linux% dig @a.root-servers.net. . rrsig +noquestion +nocoiTunents \ 
+nostats +noauthority +noadditional 

;; Truncated, retrying in TCP mode. 

; «>> DiG 9.7.2-P3 «>> @a. root-servers. net. . rrsig +noquestion 
+nocomments +nostats +noauthority +noadditional 
; {1 server found) 

;; global options: +cmd 

. 86400 IN RRSIG NSEC 8 0 86400 20101228000000 20101220230000 

40288 . RyoGBldxxX... 

. 86400 IN RRSIG DNSKEY 8 0 86400 20110105235959 20101221000000 

19036 . f8bzNvPmHR... 


The RRSIG covering fhe DNSKEY record uses key fag 19036, which mafches 
fhe KSK confained in fhe roof zone's DNSKEY RR. The roof confains ofher RRSIG 
records (for ifs SOA and NS records), buf we are more concerned wifh fhe RRSIGs 
for fhe DNSKEY and NSEG RRs. Jusf fo be exfra-sure fhaf fhe DNSKEY RR should 
be presenf, we can inspecf fhe roof's NSEG RR fo verify fhaf ifs fype is presenf: 


Linux% dig @a. root - servers .net. . nsec +noguestion +noconinients \ 

+nostats +noauthority +noadditional 

; «>> DiG 9.7.2-P3 «>> @a.root-servers.net. . nsec +noquestion 
+nocomments +nostats +noauthority +noadditional 
; (1 server found) 

;; global options: +cmd 

86400 IN NSEC ac. NS SOA RRSIG NSEC DNSKEY 

This confirms fhaf fhe roof zone officially confains RRsef fypes NS, SOA, 
RRSIG, NSEG, and DNSKEY, so we are in good shape so far. (Nofe also fhaf ac. 
is fhe firsf TED in fhe canonical ordering of fhe roof zone.) Nexf we need fo check 
ouf fhe signafures on fhe delegafion from fhe roof fo org.. This can be done as 
follows: 


Linux% dig @a.root-servers.net. org. rrsig +noquestion +nocoiTunents \ 
+nostats +noadditional +dnssec 

; «>> DiG 9.7.2-P3 «>> @a.root-servers.net. org. rrsig +noquestion 
+nocomments +nostats +noadditional +dnssec 
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; (1 server found) 

;; global options: +cmd 


org. 

172800 

IN 

NS 

do.org.afilias-nst.org. 

org. 

172800 

IN 

NS 

b2.org.afilias-nst.org. 

org. 

172800 

IN 

NS 

aO.org.afilias-nst.info. 

org. 

172800 

IN 

NS 

bO.org.afilias-nst.org. 

org. 

172800 

IN 

NS 

a2.org.afilias-nst.info. 

org. 

172800 

IN 

NS 

cO.org.afilias-nst.info. 

org. 

86400 

IN 

DS 

21366 7 2 96EEB2FFD9 ... 

org. 

86400 

IN 

DS 

21366 7 1 E6C1716CFB ... 

org. 

86400 IN 

RRSIG 

DS 8 

1 86400 20101228000000 20101220230000 


40288 . jpcJOGclwlnx9Kvz5 . . . 


The presence of the DS RRset and its associated RRSIG suggests that indeed 
there is a DNSSEC secured delegation. The RRSIG RR contains the key tag 40288, 
which refers to the third DNSKEY RR we saw earlier for the root zone (the ZSK). 
The NS records provide us with the names of the next servers to use in the next 
steps for our query. We can proceed by repeating the queries we made for the root, 
but this time using org.. We direct such queries at one of the servers specified in 
the NS RR for org. in the root: 


Linux% dig @d0.org.afilias-nst.org. org. dnskey +dnssec +nostats \ 
+noguestion +multlline 

; «>> DiG 9.7.2-P3 «>> @d0.org.afilias-nst.org. org. dnskey +dnssec 
+nostats +noquestion +multiline 
; (1 server found) 

;; global options: +cmd 
;; Got answer: 

;; ->>HEADER«- opcode: QUERY, status: NOERROR, id: 8061 
; ; flags: gr aa rd; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1 
;; WARNING: recursion requested but not available 

;; OPT PSEUDOSECTION: 

; EDNS: version: 0, flags: do; udp: 4096 


;; ANSWER 

SECTION: 





org. 

900 

IN 

DNSKEY 

256 

3 7 ( AwEAAZTErUF ... ) 

key 

id = 1743 

org. 

900 

IN 

DNSKEY 

256 

3 7 ( AwEAAazTpnm ... ) 

key 

id = 43172 

org. 

900 

IN 

DNSKEY 

257 

3 1 ( AwEAAYpYfj3 ... ) 

key 

id = 21366 

org. 

900 

IN 

DNSKEY 

257 

3 1 ( AwEAAZTjblO ... ) ; 

key 

id = 9795 

org. 

900 

IN 

RRSIG DNSKEY 

7 1 900 20101231154644 








20101217144644 21366 org. 
aIZgEsoJO+Q8ZXM ... 



org. 

900 

IN 

RRSIG DNSKEY 

7 1 900 20101231154644 




20101217144644 43172 org. MWWosWBdEmM8CiM ... 

Here we can see that four DNSKEY RRs exist, two of which are KSKs (value 
257) and two of which are ZSKs (value 256). The third one listed (21366) corre¬ 
sponds to the DS RR we found located in the root zone. The RRSIG RRs use this 
key, plus the ZSK with ID 43172. To verify their presence as legitimate, we can 
look for NSEC or NSEC3 records that may be present for org.: 
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Linux% dig @d0.org.afilias-nst.org. org. nsec +dnssec +nostats \ 
+noguestion 

; «>> DiG 9.7.2-P3 «>> @d0.org.afilias-nst.org. nsec org. +dnssec 
+nostats +noguestion 
; (1 server found) 

;; global options: +cmd 
;; Got answer: 

;; ->>HEADER«- opcode: QUERY, status: NOERROR, id: 61632 

;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 1 

;; WARNING: recursion requested but not available 

;; OPT PSEUDOSECTION: 

; EDNS: version: 0, flags: do; udp: 4096 
;; AUTHORITY SECTION: 

h9p7u7tr2u91d0v01js911gidnp90u3h.org. 86400 IN NSEC3 111 

D399EAAB 

H9RSFB7FPF2L8HG35CMPC765TDK23RP6 
NS SOA RRSIG DNSKEY NSEC3PARAM 

h9p7u7tr2u91d0v01js911gidnp90u3h.org. 86400 IN RRSIG NSEC3 7 2 

86400 20110105003654 

20101221233654 

43172 org. eBtna4fok ... 


Here we see an NSEC3 record with owner name equal to the hashed version 
of org.. It indicates the presence of a DNSKEY and RRSIG record, as well as NS 
and NSEC3PARAM records. Eollowing fhe lasf type, we can determine the NSEC3 
information: 


Linux% ./dig @a0.org.afilias-nst.info. org. nsecBparam +dnssec \ 
+nostats +noadditional +noauthority +noguestion 

; <<>> DiG 9.7.2-P3 <<>> @a0. org.afilias-nst.info. org. nsec3param 
+dnssec +nostats +noadditional +noauthoritY +noquestion 
; (1 server found) 

;; global options: +cmd 
;; Got answer: 

;; ->>HEADER«- opcode: QUERY, status: NOERROR, id: 38602 

;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 7, ADDITIONAL: 13 

;; WARNING: recursion requested but not available 

;; OPT PSEUDOSECTION: 

; EDNS: version: 0, flags: do; udp: 4096 


;; ANSWER 

org. 

SECTION: 

900 

IN 

NSEC3PARAM 1 0 

1 D399EAAB 

org. 

900 

IN 

RRSIG NSEC3PARAM 7 1 

900 20101231154644 




20101217144644 

43172 

org. fS2kFw53elY . 


We can see that this NSEC3PARAM RR matches the NSEC3 RR because of fhe 
mafch of fhe value D399EAAB (signafure). We can also see fhaf fhe signafure in fhe 
RRSIG RR came from fhe privafe key associafed wifh DNSKEY having ID 43172. 
If all signafures mafch, so far we have a valid chain of frusf. To complefe fhe chain, 
we need informafion abouf icann. org.: 
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Linux% dig @aO.org.afilias-nst.info. icann.org. any +dnssec +nostats \ 
+noadditional 

; <<>> DiG 9.7.2-P3 <<>> @a0.org.afilias-nst.info. icann.org. any 
+dnssec +nostats +noadditional 
; (1 server found) 

;; global options: +cmd 
;; Got answer: 

;; ->>HEADER«- opcode: QUERY, status: NOERROR, id: 61234 
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 8, ADDITIONAL: 3 
;; WARNING: recursion requested but not available 

;; OPT PSEUDOSECTION: 

; EDNS: version: 0, flags:; udp: 4096 
;; QUESTION SECTION: 


;icann.org. 


IN 

ANY 



;; AUTHORITY 

SECTION: 





icann.org. 

86400 

IN 

NS 

a. 

iana-servers.net. 

icann.org. 

86400 

IN 

NS 

b. 

iana-servers.org. 

icann.org. 

86400 

IN 

NS 

c. 

iana-servers.net. 

icann.org. 

86400 

IN 

NS 

d. 

iana-servers.net. 

icann.org. 

86400 

IN 

NS 

ns 

.icann.org. 

icann.org. 

86400 

IN 

DS 

41643 7 1 93358DB ... 

icann.org. 

86400 

IN 

DS 

41643 7 2 B8AB67D ... 

icann.org. 

86400 

IN 

RRSIG 

DS 

7 2 86400 20101231154644 


20101217144644 43172 org. cZlZ30w// ... 

We can see the DS RR indicating the signed delegation for icann.org. from 
org.. The RRSIG for fhe DS RRsef is signed based on fhe ZSK wifh ID 43172. 
Using one of fhe servers presenf in fhe NS records, we can look af fhe final server: 


Linux% dig @a.iana-servers.net. icann.org. dnskey +dnssec +nostats \ 
+noguestion +multiline 

; <<>> DiG 9.7.2-P3 <<>> @a.iana-servers.net. icann.org. dnskey +dnssec 
+nostats +noquestion +multiline 
; (1 server found) 

;; global options: +cmd 
;; Got answer: 

;; ->>HEADER«- opcode: QUERY, status: NOERROR, id: 22065 

;; flags: qr aa rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1 

;; WARNING: recursion requested but not available 

;; OPT PSEUDOSECTION: 

; EDNS: version: 0, flags:; udp: 4096 
;; ANSWER SECTION: 


icann.org. 

3600 

IN 

DNSKEY 

256 

3 

7 

( AwEAAbDmrVc ... ) ; 

key 

id = 41295 

icann.org. 

3600 

IN 

DNSKEY 

256 

3 

7 

( AwEAAbgrYZd ... ) ; 

key 

id = 55469 

icann.org. 

3600 

IN 

DNSKEY 

257 

3 

7 

( AwEAAZuSdr4 ... ) ; 

key 

id = 7455 

icann.org. 

3600 

IN 

DNSKEY 

257 

3 

7 

( AwEAAcyguBH ... ) ; 

key 

id = 41643 

icann.org. 

3600 

IN 

RRSIG DNSKEY 

7 

2 3600 20101229153632 




20101222042536 41643 icann.org. 
UxR/5vyOIS ... 



Section 18.10 DNS Security (DNSSEC) 


909 


Here we can see that four DNSKEY RRs exist—two KSKs and two ZSKs. The 
fourth one listed (41643) corresponds to the DS RR we found locafed in fhe org. 
zone. The RRSIG RR uses fhis key. To find fhe answer fo our ulfimafe query, we 
requesf fhe A record: 


Linux% dig @a.iana-servers.net. www.icann.org. a +dnssec +nostats \ 
+noguestion +noauthority +noadditional 

; <<>> DiG 9.7.2-P3 <<>> @a.iana-servers.net. www.icann.org. a +dnssec 
+nostats +noguestion +noauthoritY +noadditional 
; (1 server found) 

;; global options: +cmd 
;; Got answer: 

;; ->>HEADER«- opcode: QUERY, status: NOERROR, id: 56258 

;; flags: gr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 6, ADDITIONAL: 3 

;; WARNING: recursion requested but not available 

;; OPT PSEUDOSECTION: 


; EDNS: version: 0, 
;; ANSWER SECTION: 

flags:; 

udp: 

4096 


WWW.icann.org. 

600 

IN 

A 

192.0.32.7 

WWW.icann.org. 

600 

IN 

RRSIG 

A 7 3 600 20101229143630 


20101222042536 55469 icann.org. 
YRhlL/RA ... 


We have finally reached fhe end of fhe chase for fhe A RR for www.icann. 
org.. If confains fhe IP address 192.0.32.7, signed by an RRSIG RR using key 
ID 55469. This is fhe key from fhe fourfh DNSKEY RR we saw af fhe apex of fhe 
icann.org. zone. So af fhis poinf if would appear fhaf all is order. However, we 
have nof demonsfrafed fhaf all fhe signafure values are acfually correcf. To do fhis 
validafion, fhe following command may be execufed: 


Linux% dig @a.root-servers.net. www.icann.org. a +sigchase +topdown \ 
+trusted“key=trusted“keys 


This command works if fhe dig program has been compiled wifh fhe -DDIG_ 
SIGCHASE=1 compile-fime opfion and fhe file trusted-keys confains fhe roof's 
DNSKEY RRsef. Affer many lines of oufpuf, we find fhaf if does indicafe success. 
A simpler mefhod for checking fhe validify can be achieved using a DNS/DNS- 
SEG-checking Web sife such as http://dnsviz.net. Oufpuf from such a query 
is shown in Eigure 18-43. 

Here we can see a successful validafion for fhe A and AAAA RR fypes for fhe 
domain name www.icann.org.. Each recfangle represenfs a zone and confains 
ifs name and fhe fime if was analyzed. Wifhin each zone are ovals represenfing 
elemenfs in fhe chain of frusf, eifher DNSKEY or DS RRs. Dashed ovals indicafe 
fhaf fhe keys are nof being used for signafures of inferesf. Arrows befween ovals 
indicafe RRSIG or DS digesfs. Two fypes of algorifhms are represenfed. In fhe roof 
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zone, "alg = 8" indicates that RSA/SHA-256 [RFC5702] signatures are in use. In 
other zones, "alg = 7" indicates RSA/SHA-1 that permits the use of NSEC3 records 
[RFC5155]. For the DS RR in the root zone, "digest algs = 1,2" indicates that SHA-1 
[RFC4034] and SHA-256 [RFC4509] are supported. 

18.10.3 Transaction Authentication (TSiG, TKEY, and SiG(O)) 

Some transactions in DNS, such as zone transfers and dynamic updafes, could 
compromise fhe DNS sfrucfure or confenfs if improperly used. Consequenfly, 
fhey require some form of aufhenficafion. Even convenfional DNS resolufion may 
require aufhenficafion if a resolver expecfs fo depend on validafed DNS resolu- 
fions buf does nof implemenf full DNSSEC processing. Wifh fransacfion aufhenfi¬ 
cafion, fhe exchange befween a parficular resolver and server (or befween servers) 
is profecfed. Nofe, however, fhaf fransacfional securify does nof direcfly profecf 
fhe confenfs of fhe DNS, as does DNSSEC. As a resulf, DNSSEC and fransacfion 
aufhenficafion are complemenfary and can be deployed fogefher. DNSSEC pro¬ 
vides dafa origin aufhenficafion and infegrify of zone dafa, while fransacfion 
aufhenficafion provides infegrify and aufhenficafion for a parficular fransacfion 
befween a clienf and a server wifhouf checking fhe correcfness of fhe confenf 
being exchanged. 

There are fwo primary mefhods for aufhenficafing DNS fransacfions: TSIG 
and SIG(O). TSIG uses shared keys and SIG(O) uses public/privafe key pairs. To 
help ease fhe burden of deploymenf, a TKEY RR fype can be used fo help form 
keys (e.g., by holding public DH values) for eifher TSIG or SIG(O). We will begin 
by discussing TSIG, fhe more common of fhe fransacfion securify mechanisms. 

18.10.3.1 TSIG 

Secret Key Transaction Authentication for DNS or Transaction Signatures (TSIG) 
[RFG2845] adds fransacfional aufhenficafion for DNS exchanges using signafures 
based on shared secref keys. TSIG makes use of a TSIG pseudo-RR fhaf is com- 
pufed on demand and is used only fo secure a single fransacfion. The formaf of 
fhe RDATA porfion of a TSIG pseudo-RR is shown in Figure 18-44. 

The figure shows fhe formaf of a TSIG pseudo-RR. Such RRs are senf in fhe 
addifional dafa secfion of a DNS requesf or response. The original MAG algorifhm 
specified in [RFG2845] was based on HMAG-MD5, buf newer GSS-API (Kerbe¬ 
ros) [RFG3645] and SHA-1- and SHA-256-based algorifhms have since been speci¬ 
fied in [RFG4635]; fhe currenf lisf is available af [TSIGALG]. The algorifhm names 
were envisioned fo be encoded as domain names (e.g., HMAG-MD5.SIG-ALG. 
REG.INT), buf now mosf use descripfive sfrings (e.g., hmac-shal, hmac-sha256). 
The 48-bif Signed Time field is in UNIX fime formaf (seconds since January 1,1970, 
UTG) and gives fhe fime fhe message confenfs were signed. This field is covered 
in fhe digifal signafure and is designed fo defecf and prevenf replay attacks. The 
consequence of using an absolufe fime here is fhaf peers using TSIG musf agree on 
fhe fime fo wifhin fhe number of seconds specified by fhe Fudge field. The MAC 
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0 15 16 31 


Algorithm Name (encoded as domain name, variable) 

Time Signed (48 bits) 


Fudge (16 bits) 

MAC Size (16 bits) 

MAC (variabie) 


Originai ID (16 bits) 

Error (16 bits) 

Other Length (16 bits) 

Other Data 


(variable, used for errors) 


Figure 18-44 The TSIG pseudo-RR RDATA area contains a signature algorithm ID, signature time 
and time fudge factor, and a MAC. Originally, only an MD5-based signature was 
used, but now SHA-1- and SHA-2-based signatures have been standardized. TSIG 
peers must be time-synchronized to within the number of seconds in the Fudge field. 
TSIG RRs are carried in the additional data section of a DNS message. 


Size field gives the number of bytes required to contain the MAC in the MAC field 
and depends on the particular MAC algorithm. The Other Length field gives the 
size of the Other Data field in bytes, which is used only in carrying error messages. 

To see TSIG in action, we can construct a sample zone called dynzone. and 
perform a signed dynamic update. We use the nsupdate program supplied with 
BIND9 to perform the update: 


Linux% nsupdate 

> zone dynzone. 

> server 127.0.0.1 

> key tsigkey.dynzone. 1234567890abcdef 

> update delete two.dynzone. 

> send 


This series of instructions forms a DNS update message signed using TSIG 
that is sent to the server once the send instruction is issued. The request is shown 
in Figure 18-45. 

In this figure, a dynamic DNS update request has been signed using the 
HMAG-MD5 signature algorithm. The signing key's name is tsigkey.dynzone.. 
The request is to update the zone dynzone. byremovingtheentry two. dynzone.. 
The name of the signature algorithm is HMAG-MD5.SIG-ALG.REG.INT, which is 
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1 File Edit View Go Capture Analyze Statistics Telephony lools Internals Help I 


♦ « ^ a |B|I3' et Q. Gt □ i# .. 

No. Time Protocol Src Port DestPort 

Info 



2 0.044373 DNS 53 10072 

Dynamic update response v 

> 

Is Frame 1: 158 bytes on wire (1264 bits), 

158 bytes captured (1264 bits) 

s Null/Loopback 

S Internet Protocol version 4, src: 127.0 

.0.1 (127.0.0.1), Dst: 127.0.0.1 (127.0.0.1) 

ij user Datagram Protocol, src Port: 10072 

(10072), Dst Port: 53 (53) 

S Domain Name System (query) 


fResponse in: 21 
Transaction ID: 0x3d82 
a Flags: 0x2800 CDynamic update) 


0. = Response: Message is a query 

.010 1. = opcode: Dynamic update (5) 

.0.= Truncated: Message is not truncated 

.0. = Recursion desired: Don't do query recursively 

.0. = 2 : reserved (0) 

.0 .... = Non-authenticated data: unacceptable 

zones: 1 
Prerequisites: 0 
updates: 1 


Additional rrs: 1 
a zone 

a dynzone: type SOA, class in 
N ame: dynzone 

Type: soa (start of zone of authority) 
class: IN (0x0001) 
a updates 

a two.dynzone: type any, class any 
N ame: two.dynzone 

Type: any (Request for all records) 
class: ANY (OxOOff) 

Time to live: 0 seconds 
Data length: 0 
a Additional records 

a tsigkey.dynzone: type TSIG, class ANY 
Name: tsigkey.dynzone 
Type: tsig (Transaction signature) 
class: ANY (OxOOff) 

Time to live: 0 seconds 
Data length: 58 

Algorithm Name: hmac-md5.sig-alg.reg.int 

Time signed: Dec 22, 2010 14:05:58.000000000 Pacific Standard Time 
Fudge: 300 
MAC Size: 16 
a MAC 

original Id: 15746 
Error: No error (0) 
other Len: 0 


Figure 18-45 A DNS dynamic update signed using TSIG. The request is to delete the RR for two. 

dynzone.. The request is signed using the key with name tsigkey.dynzone.. The 
signature algorithm is HMAC-MD5, which produces a 128-bit (16-byte) signature. 


the only signature algorithm supported by this particular software package. Note 
that the Original ID field (15746 decimal) mafches fhe value of fhe Transaction ID 
field (0x3d82). The response confirms fhaf fhe updafe was successful, as shown in 
Figure 18-46. 

Figure 18-46 show a successful response fo a DNS dynamic updafe requesf 
signed using TSIG. The Flags field indicafes fhaf a dynamic updafe response con- 
fains no errors. Once again, fhe TSIG pseudo-RR is confained in fhe addifional 
informafion area. 
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^ tsiglocal.td 


File Edit View Go Capture Analyze Statistics Telephony lools Internals Help 

etvidiiiii I B| Q Q. <0i. □ 01 » 

No. Time Protocol Src Port DestPort Info 

1 0.000000 DNS 10072 53 Dynamic update SOA dynzone 


2 0.044373 DNS 53 10072 Dynamic update response 


(S Frame 2: 158 bytes on wire (1264 bits), 158 bytes captured (1264 bits) 

S Nu11/Loopback 

a Internet Protocol version 4, src: 127.0.0.1 (127.0.0.1), Dst: 127.0.0.1 (127.0.0.1) 
a user Datagram Protocol, src Port: 53 (53), Dst Port: 10072 (10072) 

S Domain Name system (response) 
rPeauesT in: 11 
[Time: 0.044373000 seconds] 

Transaction ID: 0x3d82 

a Flags: 0xa880 (Dynamic update response, no error) 


1. = Response: Message is a response 

.010 1. = Opcode: Dynamic update (5) 

.0. = Authoritative: server is not an authority for domain 

.0.= Truncated: Message is not truncated 

.0. = Recursion desired: Don't do query recursively 

.1. = Recursion available: server can do recursive queries 

.0. = 2 : reserved (0) 

.0.= Answer authenticated: Answer/authority portion was not auth 

.0 .... = Non-authenticated data: unacceptable 


. 0000 = Reply code: No error (0) 

Zones: 1 
Prerequisites: 0 
updates: 1 
Additional RRs: 1 
B zone 

S dynzone: type SOA, class IN 
Name: dynzone 

Type: SOA (start of zone of authority) 

Class: IN (0x0001) 

S Updates 

a two.dynzone: type any, class any 
N ame: two.dynzone 

Type: any (Request for all records) 
class: ANY (OxOOff) 

Time to live: 0 seconds 
Data length: 0 
a Additional records 

a tsigkey.dynzone: type TSIG, class any 
N ame: tsigkey.dynzone 
Type: TSIG (Transaction signature) 
class: ANY (OxOOff) 

Time to live: 0 seconds 
Data length: 58 

Algorithm Name: hmac-md5.sig-alg.reg.int 

Time signed: Dec 22, 2010 14:05:58.000000000 Pacific Standard Time 
Fudge: 300 
MAC Size: 16 
a MAC 

original Id: 15746 
Error: No error (0) 
other Len: 0 

> 


Figure 18-46 A DNS dynamic update response signed using TSIG. The RRset two.dynzone. has 
been successfully removed using dynamic update. 


18.10.3.2 SIG(O) 

Early versions of DNSSEC included signature (SIG) resource records that corre¬ 
spond to the modern RRSIG RRs discussed previously. However, a particular kind 
of SIG RR called SIG(O) [REG2931] does not cover static records in the DNS but 
instead is generated dynamically for transactions. The 0 part of SIG(O) refers to the 
length of data within an RR covered by the signature. As a result, SIG(O) records 
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can in principle be used instead of TSIG RRs to achieve the same result. However, 
they are implemented in different ways. Most importantly, SIG(O) places its basis 
of trust in public keys instead of shared keys. SIG(O) appears to be shrinking in 
popularity in favor of TSIG, so we do not discuss it further. 

18.10.3.3 TKEY 

The TKEY meta-RR type is intended to simplify the deployment of DNS transac¬ 
tion security such as TSIG and SIG(O) [RFG2930]. To do this, TKEY RRs are dynam¬ 
ically created and sent in the additional information section of DNS requests and 
responses. They can contain either keys or material used to form keys such as DH 
public values. It may be useful in local deployments but is not in widespread use. 

18.10.4 DNSSEC with DNS64 

In Ghapter 11 we described DNS64, which translates IPv6 DNS requests into IPv4 
DNS requests and can synthesize AAAA records based on A records found in the 
IPv4 DNS. The scheme is useful for allowing IPv6-only hosts to access IPv4 servers 
and services. DNS64 works by synthesizing AAAA records. With DNSSEG, how¬ 
ever, DNS RRs need to be signed by the signing authority (typically the domain 
name owner or zone administrator). This presents a challenge: How can DNS64 
synthesize RRs if it lacks the keys to produce DNSSEG-compatible signatures? The 
answer is, essentially, that it does not (see Sections 5.5 and 6.2 in [RFG6147]). 

To operate DNS64 in conjunction with DNSSEG, the validation function 
is performed either in the host (where DNS64 could be implemented) or by the 
DNS64 device, assuming there exists a secure channel between a stub resolver 
and the DNS64 acting as a recursive name server. A validating DNS64 is known as 
vDNS64. A vDNS64 interprets the CD and DO bits in an incoming query. If neither 
is set, the vDNS64 performs synthesis and validation but does not set the AD bit 
in the (validated) response. If the DO bit is set and the CD bit is not, the vDNS64 
performs validation and synthesis and returns a validated response with the AD 
bit set (which the client presumably interprets as meaning that the returned RRs 
are authentic). Note that the DNS64 first requests AAAA records on the IPv4 side 
and synthesizes A records only when it can validate that no AAAA records with 
the same owner exist. If both the DO and CD bits are set, the DNS64 may perform 
validation but not synthesis. In this case, it is presumed that the client will per¬ 
form validation. This case represents a potential problem because if the client is 
security-aware but translation-oblivious, the returned RRs will probably not be 
usable in the IPv6 addressing realm. 


18.11 DomainKeys Identified Maii (DKIM) 

DomainKeys Identified Mail (DKIM) [RFC5585] is intended to provide an associa¬ 
tion between an entity and a domain name that can be used to help determine 
the party responsible for originating a message, especially in the e-mail context. 
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It provides a method to help authenticate the signer of a message, which is not 
necessarily the sender, and this can be used in helping to fight spam at the e-mail 
distribution level (i.e., between mail agents). This is accomplished by adding a 
DKIM-Signature field to the basic Internet message format [RFC5322]. This field 
contains a digital signature of the header and body of the message. DKIM replaces 
an earlier standard called DomainKeys, which uses the DomainKey-Signature field. 

18.11.1 DKIM Signatures 

To produce a digital signature for a message, a Signing Domain Identifier (SDID) 
uses RSA/SHA-1 or RSA/SHA-256 and an associated private key. SDIDs are 
domain names from the DNS and are used to retrieve public keys stored as TXT 
RRs. A DKIM signature is encoded as a message header field using Base64 (such 
as PEM) that signs an explicitly listed set of message fields and the message body. 
When receiving an e-mail, for example, a mail transfer agent uses the SDID to 
perform a DNS query to find the corresponding public key, which it then uses to 
verify the signature. This avoids requiring a PKI. The owning domain name is 
constructed from the domain itself along with the selector (public key selector). For 
example, the public key for the selector key3 5 in domain e xamp le.com would be 
a TXT RR owned by key35._domainkey.example.com. 

The DKIM-Signature field [RFC6376] is added to a message header and may 
contain several subfields (see [DKPARAMS] for the complete list). The operation 
of DKIM is conceptually similar to the DNS Sender Policy Framework (SPF; see 
Chapter 11) but is stronger because of the cryptographic digital signature. DKIM 
and SPF can be used together. 

DKIM-enabled domains may elect to participate in Author Domain Signing 
Practices (ADSP) [RFC5617]. ADSP involves the creation of a machine-readable 
signing practices statement for a domain. Such records are placed in the DNS using 
TXT RRs with owner name equal to _adsp._domainkey.domain.. At present 
ADSP records are simple and indicate only how the authoring domain uses DKIM 
signatures. The values may be unknown, all, or discardable. These are really 
hints as to what a receiving agent might do with a received message. The value 
unknown indicates no particular statement, all indicates that the author signs 
all messages but unsigned ones may still be worthwhile, and discardable indi¬ 
cates that unsigned messages should be considered subject to discarding, dis¬ 
cardable is the most stringent level. 

18.11.2 Example 

To get an idea of how a DKIM signature appears in an e-mail, we can simply 
extract the DKIM-Signature field from an e-mail message generated from a large 
e-mail provider such as Google's Gmail: 

DKIM-Signature: v=l; a=rsa-sha256; c=relaxed/relaxed; 
d=gmail.com; s=gamma; 
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h=domainkeY-signature:mime-version:received: 
sender:received:date 

:x-google-sender-auth:message-id:subject:from:to:content-type; 
bh=PU2XIErWsXvhvtlW96ntPWZ2VImj VZ3vBY2T/A+wA3A= ; 
b=WneQe 6kpeu/BfMfa2RSlAlITvYKfIKmoQRXNc 
IQJDIVoE3 8 + fGDaj OuhNmSvXp/8kJ 
I8HqtkV4/P6/QVPMN+/SbSSdsnlhz OS/YoP 
bZx0Lt2bD67G4HPsvm6eLsaIC9rQECUSL 
MdaTBK3 BgFhYo3nenq3 + 8GxTe91 + zBcqWAVPU= 


This indicates a version 1 signature and digest algorithm of SHA-256 signed 
using RSA. The header and body canonicalization algorithms are both "relaxed," as 
shown by the c= field. Canonicalizafion algorifhms are used fo rewrife messages 
in a consisfenf form. The currenf opfions are "simple" (fhe defaulf), which does nof 
alfer fhe fexf, and "relaxed," which can rewrife fhe inpuf in common ways such 
as alfering whifespace and wrapping long header lines. The selector (s =) is called 
gamma and fhe domain (d=) is gmail.com. We shall use fhese later fo refrieve 
fhe appropriate public key. The header fields used in compufing fhe signafure 
(indicated by h=) include domainkey-signature (predecessor fo DKIM), ver¬ 
sion of MIME, received, sender date, x-google-sender-auth, message-id, 
subject, from, and content-type. The bh= subfield indicates fhe hash value 
on fhe message body expressed in Base64. The b= value confains fhe RSA signa¬ 
fure on fhe hash of fhe headers lisfed in fhe h= subfield. 

To refrieve fhe public key fo validate fhe signafure, we can form fhe following 
query: 


Linux% dig gaiiinia._domainkey.gmail.com. txt +nostats +noguestion 

; <<>> DiG 9.7.2-P3 <<>> gamma._domainkeY.gmail.com. txt 
+nostats +noquestion 
;; global options: +cmd 
;; Got answer: 

;; ->>HEADER«- opcode: QUERY, status: NOERROR, id: 17372 
; ; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 

;; ANSWER SECTION: 

gamma._domainkeY.gmail.com. 296 IN TXT "k=rsa\; t=Y\; p=MIGfMAOGCS 

qGSIb3DQEBAQUAA4GNADCBiQKBgQDIhYR3oItOY22ZOaBrIVe9m/iME3RqOJeasANSpg2YTHTYV 
+Xtp4xwf5gTjCmHQEMOs0qYu0FYiNQPQogJ2t0Mfx9zNu06rfRBDjiIU9tpx2T+NGlWZ8qhbiLo 
5BY8apJavLYqTLavYPSrvsxOB3YzC63T4Age2CDqZYA+OwSMWQIDAQAB" 


This resulf indicafes fhaf fhe key is an RSA public key. The t=y enfry denofes 
fhaf fhe domain is fesfing DKIM, meaning fhaf fhe resulfs of any DKIM validafion 
should nof ulfimafely affecf fhe message delivery process. To see an example of an 
ADSP, we can execute fhe following command: 

Linux% host -t txt _adsp._domainkey.paypal.com. 

_adsp._domainkey.paypal.com descriptive text "dkim=discardable" 
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Here we can see that Paypal has elected to use the most stringent DKIM sign¬ 
ing policy, suggesting that messages failing DKIM validation should be subject to 
being discarded. The use of ADSP statements at present is fairly rare because of 
the wide variety of e-mail systems and the ways that various mail agents rewrite 
messages. 


18.12 Attacks on Security Protocols 

Attacks on security protocols are somewhat different from the attacks on protocols 
we have seen in other chapters. Attacks discussed in other chapters tend to com¬ 
promise some protocol that was never really designed with security in mind by 
taking advantage of some design or implementation flaw. Attacks against security 
protocols not only take these forms but may also involve cryptographic attacks 
that somehow subvert the mathematical basis upon which the security depends. 
Attacks can be successful against poor algorithms, weak or too-short keys, or 
poor combinations of various components that render an otherwise secure system 
much weaker. (A classic and fascinating example can be seen in the cryptanalysis 
of the VENONA system [VENONA].) 

To understand some of the types of attacks targeting security protocols, we 
will begin from the lowest layer and work our way up. A number of attacks have 
been waged against 802.11 and EAP. Early security in 802.11 (e.g., WEP and WPA- 
TKIP) has been shown to be easily compromised cryptographically [TWP07] 
[OM09], and WPA2-AES is believed to be substantially more resilient, although 
use of poorly selected pre-shared keys (PSKs) can represent a significant vulner¬ 
ability to dictionary attacks. 

EAP does not have its own authentication method but can inherit vulnera¬ 
bilities of the authentication methods on which it depends. Once again, systems 
based on EAP using keys derived from user passwords (e.g., EAP-GSS, EAP-LEAP, 
EAP-SIM) are often vulnerable to dictionary attacks. 802.1X/EAP is vulnerable 
to MITM attacks involving tunneled authentication protocols as discussed in 
[ANN02]. The problem relates to deriving a session key after only one side of a 
two-party connection has been authenticated. Por example, if a server authenti¬ 
cates to a client and this exchange is used as the basis to form a tunnel secured 
by a derived session key where another protocol that authenticates in the reverse 
direction operates inside, a MITM attack involving impersonation of the legiti¬ 
mate client becomes possible. 

A number of attacks have been published against IPsec, including a class of 
attacks that exploit the use of encryption without integrity protection [PY06], a 
configuration option supported but discouraged by the IPsec documentation. In 
essence, the ability to modify the ciphertext undetected using a bit flipping attack 
can cause encrypted datagrams to be decrypted into datagrams that have been 
corrupted in predictable ways. Por example, a tunnel mode ESP datagram with 
its bits flipped appropriately may decrypt to a datagram with an artificially 
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increased Internet Header Length (IHL) field that causes the payload to be processed 
as (invalid) IP options, ultimately generating an ICMP message that may be of use 
to an attacker. 

At the transport layer, SSL 2.0 was shown to be vulnerable to a cipher suite 
rollback attack, in which a MITM could cause each end of an SSL connection to 
conclude that the peer is capable of only weak encryption. Doing so causes the 
peers to adopt an insecure cipher suite, which the attacker can exploit. A more 
sophisticated attack on SSL/TLS took advantage of the order of operations per¬ 
formed at a receiver: decrypt, remove padding, and check MAC. If the padding 
length or MAC is incorrect, an SSL error message is generated. By observing the 
timing of these error messages, it was possible to create a padding oracle [CHVV03] 
to recover plaintext from OpenSSH. A padding oracle tells whether the plaintext 
used to create a ciphertext had a valid amount of padding. As mentioned previ¬ 
ously, a more recent attack (on TLS 1.2) involves a MITM attack whereby a prefix of 
arbitrary length is injected into a TLS association, which is then renegotiated (but 
continued) when a legitimate client arrives [RD09]. The solution involves binding 
the previous channel parameters to the subsequent channel parameters using a 
TLS extension. The issue of channel binding and security is covered more broadly 
in [RFC5056]. 

Securing the DNS has been a long time coming, but the importance was 
underscored by the Kaminsky cache poisoning attack we described in Chapter 11. 
One of the original problems was the enumeration attack made available (actually 
required) by the use of NSEC records and countered by the use of NSEC3 records, 
if used properly [BM09]. At the end of 2009, Dan Bernstein mentioned a number 
of problems with DNSSEC in his keynote talk at a workshop [B09]: it can be used 
as a basis for amplification of DoS attacks, it leaks zone data even with NSEC3, 
its implementations contain exploitable bugs, signatures cannot be revoked, the 
cryptography may be subject to cryptanalysis, and some NS and A records pose 
vulnerabilities. At the time of writing, the root zone has been signed only recently, 
and few organizations have fully adopted DNSSEC. It is therefore likely that a 
variety of improvements and modifications will be implemented in the years to 
come. 


18.13 Summary 

The subject of security is broad and interesting, and we have only scraped the 
surface in this chapter. We desire several important properties of communica¬ 
tion security, and typically these consist of some combination of confidentiality, 
authentication, integrity, and nonrepudiation. Cryptography is our most impor¬ 
tant tool for achieving these information security properties. It involves a set of 
algorithms and keys. The two most important forms are symmetric or "secret 
key" cryptography, which has good computational performance but requires keys 
to be kept secret, and public key (asymmetric key) cryptography whereby each 
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principal has a key pair and one key is made public. Public key cryptography 
supports both authentication and confidentiality and can be combined with sym¬ 
metric key cryptography for better performance. Other algorithms that involve 
mathematics closely related to cryptography include Diffie-Hellman key agree¬ 
ment used to establish symmetric keys, pseudorandom functions for selecting 
random components to form keys, and MACs used to check message integrity. 
Protocols that use random nonces attempt to ensure freshness and resist replay 
attacks by requiring queries and responses to hold a common recently generated 
value. Salt (in the cryptographic sense) is used to perturb algorithms or input to 
algorithms in order to make dictionary attacks more difficult to mount. 

When relying on a public key, we ordinarily want the public key to be signed 
or authenticated by some entity or group that we trust. A public key infrastructure 
or PKI that involves one or more certificate authorities is commonly used for this 
purpose, but web of trust models are also available. The most common format for 
holding PKI public keys (and other material) is based on the ITU-T X.509 standard 
for PKI and certificates. Certificates are usually signed recursively forming a tree, 
culminating at some top-level root of trust or trust anchor. To ensure that the trust 
chain is in place, certificates must be validated to ensure that the trust chain is 
unbroken and each chain element has not been revoked. Certificate status can be 
evaluated using widely distributed certificate revocation lists (CRTs) or using an 
online protocol such as OCSP The entire certificate validation process can also 
be delegated to another party using SCVP, a protocol developed for this specific 
purpose. 

There are a variety of file formats for holding certificates and keys. The DER 
or CER format is a binary encoding based on ASN.l. The PEM format expresses 
the DER encoding in ASCII, so such files are easily edited and inspected. The 
PKCS#12 (successor to Microsoft's PEX) format can hold both certificates and pri¬ 
vate keys and is ordinarily encrypted for protection of the private key material. A 
variety of programs such as opens si are capable of converting between formats. 

There are security protocols at every protocol layer, and some between lay¬ 
ers. Working from layer 2 up, some link technologies include their own encryp¬ 
tion and authentication protocols, although these are not ordinarily considered 
TCP/IP protocols. In TCP/IP, EAP is used to establish authentication with a wide 
variety of mechanisms such as machine certificates, user certificates, smart cards, 
passwords, and so on. EAP is most often used in enterprise settings that have a 
backend authorization or AAA server. EAP can also be used for authentication in 
other protocols such as IPsec. 

IPsec is a collection of protocols that provide security at layer 3: IKE, AH, 
and ESP. IKE establishes and manages security association between two parties. 
Security associations can involve authentication (AH) or encryption (ESP) and can 
operate in either transport or tunnel mode. In transport mode, the IP header is 
modified for authentication or encryption, while in tunnel mode an IP datagram 
in its entirety is placed inside a new IP datagram. ESP is the most popular. All IPsec 
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protocols can use different algorithms and parameters (cryptographic suites) for 
encrypfion, infegrify profecfion, DH key agreemenf, and aufhenficafion. 

Moving up fhe slack, fransporf-layer securify (currenf version TLS 1.2) profecfs 
informafion moved befween applicafions. if has ifs own infernal layering consisf- 
ing of a record-layer profocol and fhree handshaking profocols called fhe Cipher 
Change profocol, Alerf profocol, and Handshake profocol. in addifion, fhe Record 
profocol supporfs applicafion dafa. The record layer is responsible for encrypfing 
and infegrify-profecfing dafa based on paramefers supplied by fhe Handshake 
profocol. The Cipher Change profocol is invoked fo change from a previously sef- 
up pending profocol sfafe fo an acfive profocol sfafe. The Alerf profocol indicafes 
errors or connecfion problems. TLS wifh TCP/IP is fhe mosf widely used securify 
profocol and supporfs encrypfed Web browser connecfions (HTTPS). A varianf of 
TLS called DTLS adapfs TLS for use wifh dafagrams and profocols such as UDP 
and DCCP 

To help secure hosf names and fhe Web beffer, DNSSEC is fargefed af provid¬ 
ing securify for fhe DNS. On July 15, 2010, fhe Infernef's signed roof zone was 
puf info operafion, safisfying a prerequisife for worldwide deploymenf. DNS¬ 
SEC works by employing several new resource records in fhe DNS: DNSKEY, DS, 
NSEC/NSEC3/NSEC3PARAM, and RRSIG. The firsf fwo hold and refer fo public 
keys used for signing fhe sfrucfure and confenfs of a zone. The NSEC or NSEC3/ 
NSEC3PARAM records help provide a canonical ordering of names and lisf of 
fypes presenf for a domain name. This allows a query fo reliably defermine fhe 
nonexisfence of a domain name or presence of a parficular fype for a parficular 
domain name. RRSIG records hold signafures on ofher records, and for a zone 
fo be signed, all aufhorifafive RRs wifhin fhe zone musf have associafed RRSIG 
RRs. Once sef up, securify of DNS queries is checked by a validafing resolver or 
name server fhaf requires a frusf anchor. Such sysfems check fo ensure fhaf digi- 
fal signafures mafch fhe public keys supplied by fhe DNS. This allows for errors 
fo be generafed when some record is found fo be inconsisfenf, and if is hoped 
if can fhwarf domain name hijacking affacks in which attackers masquerade as 
legifimafe hosfs. in some cases, DNS fransacfions are also secured. The TSIG and 
SiG(O) profocols provide a form of channel aufhenficafion, buf only in fhe scope of 
DNS fransacfions. These profocols are used for fransacfions such as DNS dynamic 
updafes and zone fransfers. 

Affacks on securify profocols include nof only fhe common exploifafion of 
implemenfafion bugs and insecure designs buf also mafhemafical compromises 
and "side channel" affacks fhaf are used fo discover secref informafion (e.g., bifs of 
keys). Over fhe years if has become clear fhaf flexibilify is needed in fhe sfrengfh 
of fhe crypfography used fo secure communicafions, so mosf of fhe profocols we 
have discussed provide for crypfographic suifes fhaf can evolve as compufafional 
power improves and addifional experience is gained. Many seemingly secure pro¬ 
focols, even fhose fhaf have received exfensive scrufiny by experfs, have fallen 
prey fo an energefic sef of analysfs who seek exploifable flaws, especially when 
MITM and ofher acfive affacks are possible. Exfreme care is required in designing 
new securify profocols and operafing exisfing profocols in a secure fashion. 
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3GPP 3rd Generation Partnership Project (cellular SDO responsible for GSM, 
W-CDMA, LIE, etc.) 

3GPP2 3rd Generation Partnership Project 2 (cellular SDO responsible for 
CDMA2000, EV-DO, efc.) 

6rd IPv6 Rapid Deploymenf (an IPv6 fransifion mechanism in which IPv6 fraf- 
fic is carried over IPv4 nefworks, similar fo 6fo4 buf using IPv6 prefix assign- 
menfs based on unicasf address assignmenfs) 

6to4 Six fo Pour (carrying IPv6 fraffic in IPv4 funnels, some operafional chal¬ 
lenges have occurred) 

A Address (IPv4) (DNS RR carrying an IPv4 address) 

AAA Aufhenficafion, Aufhorizafion and Accounfing (managemenf capabilifies 
associafed wifh cerfain access profocols such as RADIUS and Diamefer) 

AAAA Address (IPv6) (DNS RR carrying an IPv6 address) 

ABC Appropriafe Byfe Counfing (in TCP congesfion confrol, a mefhod fo 
accounf for fhe number of byfes ACKed insfead of a consfanf factor when per¬ 
forming CWND compufafions; can mifigafe fhe slow window growfh associ¬ 
afed wifh delayed ACKs) 

AC Affribufe Cerfificafe (a fype of cerfificafe used fo carry affribufes such as 
aufhorizafions, buf does nof include a public key and fherefore differs from a 
PKC) 

ACCM Asynchronous Confrol Characfer Map (in PPP, indicafes which byfes 
need fo be escaped fo avoid having unwanted effecfs) 

ACD Aufomafic Collision Defecfion (procedure fo defecf and avoid IP address 
assignmenf collisions) 

ACFC Address and Confrol Pield Compression (in PPP, eliminafing fhe address 
and confrol fields fo reduce overhead) 

ACK Acknowledgmenf (an indicafion fhaf dafa has arrived af a receiver suc¬ 
cessfully; applicable fo mulfiple layers of fhe profocol sfack) 

ACL Access Confrol Lisf (lisf of filtering rules defermining which fraffic is 
permiffed, e.g., fhrough a firewall) 

ADSP Aufhor Domain Signing Pracfices (wifh DKIM, a policy sfafemenf per- 
faining fo how DKIM is used or deployed wifhin a parficular domain) 
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AEAD Authenticated Encryption with Associated Data (algorithms that per¬ 
form encryption and authentication on one portion of their input and authen¬ 
tication on another portion) 

AES Advanced Encryption Standard (current-generation U.S. encryption 
standard) 

AF Assured Forwarding (a PHB offering priority classes and prioritization 
within classes) 

AFTR Address Family Transition Router element (in DS-Lite, a SPNAT used to 
share a small number of IPv4 addresses with multiple customers) 

AH Authentication Header (optional IPsec protocol providing for authentica¬ 
tion of IP traffic, including header information, which is incompatible with 
NATs) 

AIA Authority Information Access (an X.509 certificate extension indicating 
resources useful in validating a certificate) 

AIAD Additive Increase Additive Decrease (in TCP, methods that moderate 
CWND by adding to its value when congestion appears to be low and sub¬ 
tracting from it when congestion appears to be increasing; not the standard 
TCP algorithm) 

AIMD Additive Increase Multiplicative Decrease (in TCP, methods that 
moderate CWND by adding to its value when congestion appears to be low 
and multiplying it by a fraction less than one when congestion appears to be 
increasing) 

AEG Application Layer Gateway (an agent, usually software, that converts 
protocols at the application layer) 

A-MPDU Aggregated MPDU (frame containing multiple MPDUs, part of IEEE 
802.11n) 

A-MSDU Aggregated MSDU (frame containing multiple MSDUs, part of IEEE 
802.11n) 

ANDSF Access Network Discovery and Selection Function (a portion of MoS 
indicating information about networks that may be used to influence handoff 
and network selection) 

AODV Ad-hoc On-Demand Distance Vector routing protocol (early ad-hoc on- 
demand routing protocol using distance vectors) 

AP Access Point (802.11 STA usually used to interconnect wireless and wired 
network segments) 

API Application Programming Interface (functions invoked by applications to 
obtain effects such as sending and receiving network traffic) 

APIPA Automatic Private IP Addressing (a mechanism whereby a node self- 
configures its own IP address from a particular range; usually applies to IPv4 
nodes) 
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APSD Automatic Power Save Delivery (periodic batch processing of 802.11 
frames in supporf of PSM) 

AQM Acfive Queue Managemenf (queue managemenf mefhods fhaf reacf fo 
fhe fraffic dynamics, nof including "drop-fail" fypical of FCFS/FIFO queue 
managemenf) 

ARP Address Resolufion Protocol (a profocol above fhe link layer fhaf resolves 
IPv4 addresses fo MAC layer addresses, uses link layer broadcasf addressing) 

arq Aufomafic Repeal Requesf (fhe refransmission of informafion; usually 
affer inferred loss) 

AS Aufhenficafion Server (wifh PANA, server where aufhenficafion checks are 
performed) 

AS Aufonomous Sysfem (a 16- or 32-bif number used in connecfion wifh infer- 
ISP roufing fo idenfify a collecfion of nefwork prefixes and fheir owner) 

ASM All-Source Mulficasf (mulficasf wherein any parfy can source fraffic) 

ASN.l Absfracf Synfax Nofafion One (an ISO sfandard defining fhe absfracf 
synfax for informafion buf nof fhe corresponding encoding formal; BER and 
DER are encodings for ASN.l informafion) 

AUS Applicafion Unique Siring (inpuf siring fo fhe DDDS algorifhm) 

AUTH Aufhenficafion (wifh IKE, payload confaining informafion required fo 
perform aufhenficafion of fhe sender) 

AXFR Zone Transfer (full exchange of DNS zone informafion; uses TCP) 

B4 Bridging Broadband elemenf (in DS-Life, a router which encapsulates IPv4 
fraffic in IPv6 funnels ferminafed af an AETR, a B4 does nof perform NAT 
funcfions) 

BACP Bandwidfh Allocafion Confrol Profocol (wifh PPP, a profocol for config¬ 
uring BoD) 

BAP Bandwidfh Allocafion Profocol (a profocol used fo configure links in a 
bundle for MPPP) 

BCMCS Broadcasf and Mulficasf Service Confroller (in cellular nefworks, man¬ 
ages mulficasf) 

BER Basic Encoding Rules (an ITU sfandard encoding synfax; a subsef of 
ASN.l) 

BER Bif Error Rale (number of bif errors expected per number of bifs in fransif) 

BGP Border Gafeway Profocol (infer-domain roufing profocol wifh policy 
supporf) 

BIND9 Berkeley Infernef Name Domain (version 9) (a name server software 
implemenfafion popular on UNIX-like sysfems) 

BITS Bump In fhe Sfack (opfion for implemenfing IPsec in fhe hosf) 
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BITW Bump In the Wire (option for implementing IPsec in the network) 

BL Bulk Leasequery (in DHCP, a request/response protocol to convey current 
lease information) 

BoD Bandwidth on Demand (ability to dynamically adjust available link 
bandwidth) 

BOOTP Bootstrap Protocol (precursor to DHCP; used to configure hosts) 

BPDU Bridge PDU (PDUs used by STP; exchanged by switches and bridges) 

BPSK Binary Phase Shift Keying (modulating binary using two signal phases) 

BSD Berkeley Software Distribution (UC Berkeley's version of UNIX, included 
the first widely used implementation of TCP/IP) 

BSDP Boot Server Discovery Protocol (an extension to DHCP developed by 
Apple to discover a boot image server) 

BSS Basic Service Set (IEEE 802.11 terminology for an access point and associ¬ 
ated stations) 

BTNS Better Than Nothing Security (with IPsec, an option for using certifi¬ 
cates without a full PKI but which is vulnerable to MITM attacks) 

BU Binding Update (in MIP, establishes the mapping between a MN's CoA and 
HoA) 

CA Certificate Authority (organization responsible for generating and issuing 
public/private key pairs and signing and distributing signed public keys and 
CRTs) 

CALIPSO Common Architecture Label IPv6 Security Option (security labels 
for IP packets; not widely used) 

CBC Cipher Block Chaining (an encryption mode that uses the XOR opera¬ 
tion to chain encrypted blocks together in an effort to resist re-arrangement 
attacks) 

CBCP Callback Control Protocol (in PPP, establishes a callback number) 

CCA Clear Channel Assessment (802.11 PHY-layer mechanism that detects 
channel usage) 

CCITT Comite Consultatif International Telephonique et Telegraphique (now 
ITU-T) 

CCM Counter mode with CBC Message Authentication Code (an authenticated 
encryption mode combining CTR mode encryption with CBC-MAC) 

CCMP Counter Mode with CBC-MAC Protocol (encryption used with WPA2; 
from IEEE 802.11i; successor to WPA) 

CCP Compression Control Protocol (in PPP, established the compression meth¬ 
ods to use) 
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ccTLD Country Code TLD (a TLD based on the IS03661-2 country code list) 

CDP CRL Distribution Point (a location where a CA's current CRL may be 
obtained) 

CERT Certificate (with IKE, payload containing a certificate) 

CERT Computer Emergency Response Team (groups that handle computer 
security incidents, including the first CERT at Carnegie Mellon University and 
US. Government's US-CERT) 

CERTREQ Certificate Request (with IKE, payload indicating trust anchor as an 
indication of accepfable cerfificafes) 

CGA Grypfographically Generafed Address (address generafed based on a 
hash on a public key) 

CHAP Challenge-Handshake Aufhenficafion Protocol (profocol requiring a 
challenge to mafch a response; vulnerable to MITM affacks) 

CIA confidenfialify, infegrify, and availabilify (principles of informafion secu- 
rify; fhe "CIA friad") 

CIDR Classless Infer-Domain Roufing (a move to address fhe ROAD problem 
by removing fhe IP address class boundaries buf requiring an associated 
CIDR mask to be used wifh infer-domain roufing) 

CMAC Cipher-based Message Aufhenficafion Code (a parficular way of using 
encrypfion algorifhms as a MAC) 

CN Correspondenf Node (an MN's conversafion peer in MIP scenario) 

CNAME Canonical Name (DNS RR providing an alias for anofher domain 
name) 

CoA Care-of Address (MN's address assigned while visifing non-home 
nefwork) 

CoS Class of Service (general ferm referring fo differenfiafed services based on 
differenf classes of fraffic; a concepf supported by fhe Diff Serv archifecfure) 

CoT Care-of Tesf (in a RR check, message senf fo MN via ifs CoA resulfing in 
MN obfaining a porfion of a key used fo secure BUs) 

CoTI Care-of Tesf Inif (in a RR check, friggers receiver fo send a CoT message) 

CP Configurafion Payload (wifh IKE, exfensible sfrucfure for conveying con- 
figurafion paramefers) 

CPS Cerfificafion Pracfice Sfafemenf (a CA's policy sfafemenf abouf how cerfifi¬ 
cafes are issued or managed) 

CRC Cyclic Redundancy Check (mafhemafical funcfions used fo check for bif 
errors) 

CRL Cerfificafe Revocafion Lisf (a lisf of invalid cerfificafes issued by a CA) 
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CS Cipher Suite (in TLS, the choice of cryptographic algorithm suite) 

CS Class Selector (in IP, a DSCP value designed to be compatible with the bit 
values associated with the now-deprecated "Type of Service" and "Traffic 
Class" IP header fields) 

CSMA/CA Carrier-Sense Multiple Access/Collision Avoidance (WiFi's MAC 
protocol, which involves sending when a link is idle and backing off if it is 
not) 

CSMA/CD Carrier-Sense Multiple Access/Collision Detection (Ethernet's clas¬ 
sic MAC protocol, which involves sending when a link is idle and backing off 
if collisions are detected) 

CSPRNG Cryptographycially Secure Preudo-Random Number Generator (a 
PRNG suitable for cryptographic use) 

CSRG Computer Systems Research Group (developers of BSD UNIX at UC 
Berkeley) 

CTCP Compound TCP (a "scalable" TCP variant implemented in modern Win¬ 
dows systems that combines both delay-based and packet-loss based window 
adjustments) 

CTR Counter (an encryption mode that uses a counter value to impose a 
required order on encrypted blocks while permitting parallel execution of 
encryption or decryption on multiple blocks) 

CTS Clear To Send (message authorizing sender of RTS to send) 

CW Contention Window (range of time an 802.11 station will wait before send¬ 
ing under DCF) 

CWND Congestion Window (in TCP, a limit placed on the sender's window 
size to avoid or reduce congestion) 

CWR Congestion Window Reducing (or Reduced) (in TCP, reduction of the 
sender's usable window size) 

CWV Congestion Window Verification (in TCP, a method to check and update 
the current value of CWND when deemed necessary) 

DAD Duplicate Address Detection (with IPv6 ND and SLAAC, DAD helps 
determine whether a candidate IPv6 address is already in use by sending an 
NS message for the proposed address) 

DCCP Datagram Congestion Control Protocol (a protocol that provides best- 
effort datagram service to applications and also controls congestion) 

DCF Distributed Coordination Function (CSMA/CA MAC for 802.11 networks) 

DDDS Dynamic Delegation Discovery System (methods to support lazy bind¬ 
ing of strings to data; usually used with DNS for discovery of servers for vari¬ 
ous application protocols) 



Glossary of Acronyms 939 


DDoS Distributed DoS (a network-based attack often launched by botnets) 

DER Distinguished Encoding Rules (an ITU standard encoding syntax; a sub¬ 
set of BER for ASN.l fhaf requires a unique represenfafion fo be used for each 
value) 

DES Dafa Encrypfion Sfandard (an older US. sfandard for symmefric dafa 
encrypfion using 56-bif keys) 

DF Don'f Eragmenf (an IPv4 header bif indicafing no fragmenfafion should be 
performed; imporfanf for PMTUD) 

DH Diffie-Hellman (mafhemafical profocol fo esfablish a secref value befween 
fwo parfies even in fhe presence of an evesdropper) 

DHCP Dynamic Hosf Configurafion Profocol (evolved from BOOTP; sefs up 
sysfems wifh configurafion informafion such as leased IP addresses, defaulf 
roufer, and DNS server IP address) 

DIFS DCF Infer-Frame Space (fime befween frames under 802.11 DCF) 

DIX Digifal, Infel, Xerox (creators and name of early Efhernef sfandard) 

DKIM Domain Keys Idenfified Mail (a profocol for cryptographically binding 
fhe sending domain of e-mail wifh fhe associafed originafing mail servers) 

DLNA Digifal Living Nefwork Alliance (an indusfry group focused on 

inferoperabilify and protocols for consumer media devices such as TVs, DVD 
players, DVRs, efc.) 

DMZ De-Milifarized Zone (a nefwork segmenf oufside an organizafion's inside 
firewall, usually used for hosfs providing services fo customers or fhe public) 

DNA Defecfing Nefwork Affachmenf (procedures fo defecf a change in con- 
necfion sfafe) 

DNAME Non-Terminal Name Redirecfion (DNS RR supporfing generafion of 
mulfiple CNAME records using a DNS subfree aliasing mechanism) 

DNS Domain Name Sysfem (maps names fo IP addresses and more) 

DNS64 DNS IPv4/IPv6 franslafion (a mechanism for IPv4/IPv6 coexisfence fo 
franslafe IPv4 DNS informafion for IPv6 DNS use) 

DNSKEY Key for DNS (DNS RR used wifh DNSSEC fo hold a public key) 

DNSSEC DNS Securify (original aufhenficafion and infegrify assurance for 
DNS dafa) 

DNSSL DNS Search Lisf (used wifh RAs, indicafes lisf of defaulf domain 
extensions) 

DOI Digifal Objecf Idenfifier (a mefhod for naming confenf objecfs and associ- 
afing fhem wifh informafion records) 

DoS Denial of Service (a type of resource exhausfion affack) 
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DPD Delegated Path Discovery (method for delegating the collection of all 
information required to validate a certificate path) 

DPV Delegated Path Validation (method for delegating the entire validation 
procedure for a certificate) 

DS Delegation Signer (in DNS, an RR used with DNSSEC to secure a 
delegation) 

DS Differentiated Services (in IP traffic management, methods to provide per¬ 
formance differentiation for traffic delivery) 

DS Distribution Service (in 802.11 LANs, the network or service used to inter¬ 
connect APs, which is most often a wired 802.3/Ethernet network) 

DSA Digital Signature Algorithm (an algorithm for generating digital signa¬ 
tures based on the discrete logarithm problem) 

DSACK Duplicate SACK (in TCP, a SACK variant that includes description of 
received duplicated segments) 

DSCP DS Code Point (field value in packet indicating a particular forwarding 
behavior is desired) 

DSL Digital Subscriber Line (dedicated broadband data link over POTS line) 

DS-Lite Dual Stack Lite (a framework for IPv6-based service providers to pro¬ 
vide access to dual stack or single stack clients using a combination of IPv4-in- 
IPv6 tunneling and NAT) 

DSRK Domain-Specific Root Key (key derived from an EMSK intended for use 
by systems under a single administrative authority) 

DSS Digital Signature Standard (a U.S. standard for digital signatures based 
on DSA) 

DSUSRK Domain-Specific USRK (a key combining the usage policies of a 
USRK and DSRK) 

DTLS Datagram TLS (variant of TLS used with datagram protocols such as 
UDP) 

DUID DHCP Unique Identifier (value placed in DHCP request to match 
responses) 

DUP Duplicate (used in multiple context—e.g., DUP ACKs) 

EAP Extensible Authentication Protocol (framework supporting various 
authentication methods) 

EAP-FAST EAP-Elexible Authentication via Security Tunneling (Cisco's EAP 
method using TLS that replaces its earlier LEAP EAP method) 

EAPOL EAP over LAN (e.g., EAP over Ethernet as used in IEEE 802.1X) 

EAP-TTLS EAP-Tunneled Transport Layer Security (an EAP method based on 
earlier TLS EAP method, but requires only server side to obtain certificate) 
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EC2N Elliptic Curve groups modulo a power of 2 (groups based on elliptic 
curves, in the abstract algebra sense, over the Galois Field GF(2^)) 

ECC Error Gorrecting Gode (redundant bits added to information bits usable to 
correct errors) 

ECDSA Elliptic Gurve Digital Signature Algorithm (a variant of DSA using 
EGG) 

ECE EGN Echo (in TGP with EGN, the reflection of EGN information to a TGP 
sender) 

ECN Explicit Gongestion Notification (direct method of indicating conges¬ 
tion—e.g., by routers to hosts) 

ECP Elliptic Gurve groups modulo a Prime (groups based on elliptic curves, in 
the abstract algebra sense, over the Galois Field G(P) for a prime P) 

ECT EGN-Gapable Transport (a transport protocol capable of interpreting EGN 
indicators) 

EDCA Enhanced Distributed Ghannel Access (802.11 coordinating function 
supporting QoS, from 802.11e) 

EDNSO Extension mechanisms for DNS (version 0) (a method to extend DNS 
RRs, version 0, needed by DNSSEG) 

EE Expedited Forwarding (a PHB offering a service class as if no conges¬ 
tion were present, generally implying it is the highest priority and requiring 
admission control to avoid oversubscription) 

EFO Expanded Flags Option (used with DHGP, indicates presence of additional 
options) 

EIFS Extended IFS (extended IFS used when receiving unrecognized frame 
under 802.11 DGF) 

EMSK Extended MSK (a secondary key generated in addition to the MSK by 
EAP after key derivation) 

ENUM E.164 to URl DDDS Application (a particular DDDS used to map E.164 
telephony-style addresses to URIs) 

EP Enforcement Point (with PANA, point where access control policies are 
enforced) 

EQM Equal Modulation (using the same modulation scheme on different data 
streams simultaneously) 

ERE Eligible Rate Estimate (part of TGP Westwood+; estimate of the amount of 
bandwidth that could be used by a connection) 

ERP EAP Re-authentication Protocol (an EAP extension to reduce the latency 
when re-establishing authentication) 
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ESN Extended Sequence Number (in IPsec, an extended sequence number of 64 
bits used to combat replay attacks; normal sequence numbers are 32 bits) 

ESP Encapsulating Security Payload (required IPsec protocol providing for 
authentication and/or confidentiality of traffic) 

ESSID Extended Service Set Identifier (IEEE 802.11 network name) 

EUI Extended Unique Identifier (MAC-layer address prefix format defined by 
IEEE, extended from OUI) 

EV Extended Validation (a form of certificate with enhanced identity validation 
performed prior to issuance) 

EV-DO Evolution, Data Optimized (or Only) (3GPP2 wireless broadband stan¬ 
dard; an evolution of CDMA2000) 

PACK Eorward Acknowledgment (in TCP, one more than the highest sequence 
number known to have reached the receiver; determined using SACK) 

FCFS First Come, First Served (scheduling discipline with in-order service; no 
priority) 

ECS Frame Check Sequence (general term for bits used to check for bit errors) 

EEC Forward Error Correction (using redundant bits to correct errors in data bits) 

FIFO First In, First Out (queue management discipline with in-order service; 
no re-arrangements) 

FIN Finish (a TCP header bit and last segment type sent on a TCP connection) 

FMIP Mobile IP with Fast Handovers (modification to MIPv6 with early 
handovers) 

FQDN Fully Qualifies Domain Name (a domain name with full domain exten¬ 
sion included) 

F-RTO Forward RTO (in TCP, a method to infer whether a retransmission was 
spurious and if so facilitate the avoidance of unnecessary retransmissions) 

FTP File Transfer Protocol (a TCP-based file transfer protocol using separate 
control and data connections) 

GCKS Group Controller/Key Server (in IPsec, used with GKM; holds and 
issues keys for GSAs) 

GCM Galois/Counter Mode (an authenticated encryption mode combining 
CTR mode encryption with Galois mode authentication) 

GDOI Group Domain of Interpretation (in IPsec, a group key management 
protocol based on ISAKMP and IKE) 

GENA General Event Notification Architecture (an XML-based notification 
framework using HTTP over multicast UDP; used with UPnP) 

GI Guard Interval (in communications engineering, minimum time between 
transmissions used to avoid inter-symbol interference) 
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GKM Group Key Management (in IPsec, methods to distribute key material to 
a group in order to support group SA formation) 

GMAC Galois Message Authentication Code (an authentication-only variant of 
GCM) 

GMI Group Membership Inferval (in IGMP and MLD, fhe amounf of fime a 
mulficasf roufer waifs before deciding fhere is no parficular source or no more 
group members; sef fo QRV Q1 + QRI) 

GMRP Generic Mulficasf Regisfrafion Protocol (replaced by MMRP) 

GPAD Group PAD (wifh IPsec, absfracfion of a dafabase confaining aufhenfica- 
fion dafa for all GCKS enfifies) 

GRE Generic Roufing Encapsulafion (generic encapsulafion wifhin IP 
dafagrams) 

GSA Group Securify Associafion (in IPsec, an SA esfablished among group 
members using a mulficasf protocol) 

GSAKMP Group Secure Associafion Key Managemenf Protocol (a framework 
for creafing groups wifh common crypfographic informafion, disfribufing 
policy, performing access confrol, generafing group keys, and recovering from 
group dynamic changes) 

GSPD Group SPD (in IPsec, an SPD capable of holding informafion for bofh 
SAs and GSAs) 

GSS-API Generic Securify Services API (an API fo access myriad securify 
services such as aufhenficafion, confidenfialify, efc.; fypically used wifh fhe 
Kerberos aufhenficafion system) 

gTLD Generic TLD (a TLD—such as COM, EDU, MIL—nof based based on 
counfry code) 

GVRP Generic Affribufe Regisfrafion Profocol (replaced by MRP) 

HA Home Agenf (system offering MIP helper service fo an MN) 

HAIO Home Agenf Informafion Opfion (in ICMPv6, an opfion supporfing 
MIPv6 fo indicafe address of an HA) 

HCF Hybrid Coordinafion Puncfion (coordinafing funcfion supporfing bofh 
priorify and confenfion-based 802.11 channel access) 

HDLC High-level Dafa Link Confrol (a popular ISO sfandard dafa link profo¬ 
col, fhe basis for fhe mosf popular varianf of PPP) 

HELD HTTP-Enabled Locafion Delivery (a profocol for delivering LCI using 
HTTP/TCP/IP) 

HIP Hosf Idenfify Profocol (a research profocol archifecfure focusing on mobil- 
ify and securify) 

HMAC Hash-based Message Aufhenficafion Code (a parficular way of using 
hashing algorifhms as a MAC) 
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HoA Home Address (in MIP, a MN's address from its home network) 

HOPOPT IPv6 Hop-by-Hop Option (an IPv6 option type applicable to each 
hop in a path) 

HoT Home Test (in an RR check, message sent to MN via HA resulting in MN 
obtaining a portion of a key used to secure BUs) 

HoTI Home Test Init (in an RR check, triggers receiver to send a HoT message) 

HSPA High-Speed Packet Access (3GPP wireless broadband standard; an evo¬ 
lution of WCDMA) 

HSTCP Highspeed TCP (a "scalable" TCP variant in which CWND is adjusted 
based in part on its current value; designed to operate more effectively in high 
capacity environments) 

HT High Throughput (higher speeds associated with the IEEE 802.11n 
standard) 

HTML Hyper-Text Markup Language (the basic language of the WWW) 

HTTP Hyper-Text Transfer Protocol (primary protocol of the WWW; often car¬ 
ries HTML) 

HTTPMU HTTP using UDP (a method for carrying HTTP traffic on UDP 
using multicast addressing; used to carry SSDP messages in UPnP) 

HTTPS HTTP over SSL/TLS (standard for secure WWW exchange) 

HWRP Hybrid Wireless Routing Protocol (routing protocol proposed for IEEE 
802.11s) 

lA Identity Association (in DHCP, a collection of addresses) 

lAB Internet Architecture Board (one of lETP's governing bodies; responsible 
for architectural oversight and apppointment of liasons to other SDOs) 

lAID lA Identifier (in DHCP, an ID referring to a particular lA) 

lANA Internet Assigned Numbers Authority (maintains protocol numbers and 
field values) 

IBSS Independent Basic Service Set (802.11 ad-hoc network) 

ICANN Internet Corporation for Assigned Names and Numbers (non-profit 
governing body for domain names and related policy) 

ICE Interactive Connectivity Establishment (a framework for performing NAT 
traversal, which entails trying direct connections, STUN, and finally TURN to 
enable communication in the presence of NATs) 

ICMP Internet Control Message Protocol (an information and error reporting 
protocol considered part of IP) 

ICS Internet Connection Sharing (alternative name for NAT; used with Micro¬ 
soft Windows) 
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ICV Integrity Check Value (a value used to check the integrity of a message— 
e.g., cryptographic hash) 

ID Identification (in IKE, payload indicating identity of sender) 

IDN Infernafionalized Domain Name (domain name encoding non-ASCII 
characfers) 

IEEE Insfifufe of Elecfrical and Elecfronics Engineers (SDO for link-layer profo- 
cols and more) 

lESG Infernef Engineering Sfeering Group (lETE's governing body wifh REC 
approval aufhorify) 

IETF Infernef Engineering Task Eorce (SDO for Infernef sfandards) 

IGD, IGDDC Infernef Gafeway Device/Discovery and Confrol (a UPnP proto¬ 
col for discovering and configuring gafeway devices such as home NATs) 

IGMP Infernef Group Message Protocol (a profocol to manage IPv4 mulficasf 
groups; used by routers and end hosfs) 

IHL Infernef Header Lengfh (IPv4 header field indicafing fhe header lengfh in 
32-bif words) 

IID Inferface Idenfifier (numeric idenfifier usually based on MAC address; 
used when choosing IPv6 addresses, buf nof used for fhis purpose when pri¬ 
vacy exfensions are enabled) 

IKE Infernef Key Exchange (parf of IPsec; a profocol to dynamically esfablish 
securify associafions including keys and operafing parameters) 

IMAP Infernef Message Access Profocol (used to refrieve e-mail headers and 
messages from servers) 

IMAPS IMAP over SSL/TLS (a secure profocol for fefching e-mail, supported 
by mosf e-mail programs) 

IN Infernef (in DNS, fhe class name indicafing Infernef informafion) 

IND Inverse Neighbor Discovery (provides RARP-like funcfion for IPv6) 

IP Infernef Profocol (sfandard besf-efforf Infernef packef profocol implemenf- 
ing a common absfracf dafagram on any link layer nefwork) 

IPCP IP Confrol Profocol (in PPP, an NCP used to configure an IPv4 nefwork 
link) 

IPG Infer-Packef Gap (minimum spacing befween frames in a MAC profocol) 

IPsec IP Securify (a framework for securing IP fraffic, including fhe IKE, AH, 
and ESP profocols) 

IPV6CP IPv6 Confrol Profocol (in PPP, an NCP used fo configure an IPv6 nef¬ 
work link) 
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IRIS Internet Registry Information Service (database containing information 
relating address ranges, associated AS numbers, contact information, and 
name servers) 

IRTF Internet Research Task Force (research groups affiliated with IETF via the 
lAB) 

ISAKMP Internet Security Association and Key Management Protocol (in 
IPsec, SA establishment protocol pre-dating IKE) 

ISATAP Intra-Site Automatic Tunnel Addressing Protocol (an automatic IPv6- 
to-IPv4 tunneling technology supported by Microsoft) 

ISDN Integrated Services Digital Network (combination circuit/packet 
switched data service) 

IS-IS Intermediate System to Intermediate System (ISO link-state routing 
protocol) 

ISL Cisco's Inter-Switch Protocol (Cisco's protocol for maintaining VLAN infor¬ 
mation among switches) 

ISM Industrial, Scientific, and Medical (licence-free frequency bands in much 
of the world, used by Wi-Fi) 

ISN Initial Sequence Number (in TCP, the first sequence number for a con¬ 
nection; assigned to the SYN) 

ISO International Organization for Standardization (SDO responsible for 
defining various protocols and encodings once considered for replacing 
TCP/IP) 

ISOC Internet Society (Internet standards leadership nonprofit corporation) 

ISP Internet Service Provider (an entity, often a business, that allocates 
addresses, provides DNS and routing, and works with other ISPs) 

ITU International Telecommunications Union (SDO for radio and telephony 
standards) 

ITU-T ITU Telecommunication Standardization Sector (formerly CCITT; one 
of the three "sectors" of ITU responsible for standards or "recommendations" 
such as ASN.l, X.25, DSL) 

IW Initial Window (in TCP, the initial value of CWND) 

IXFR Incremental Zone Transfer (incremental exchange of DNS zone informa¬ 
tion, uses TCP) 

KE Key Exchange (with IKE, payload used for establishing keys; generally uses 
DH) 

KSK Key Signing Key (a key used with DNSSEC for signing other keys; typi¬ 
cally has the SEP bit set) 

L2TP Layer 2 Tunneling Protocol (IETF standard link layer tunneling protocol) 
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LACP Link Aggregation Control Protocol (part of IEEE 802.1AX for managing 
link aggregafes) 

LAG Link Aggregafion Group (sef of links acfing fogefher as one virfual 
higher-performance link) 

LAN Local Area Nefwork (a nefwork wifhin a small geographic area such as a 
single sife, office, or home) 

LCG Linear Congruenfial Generator (a deferminisfic fype of popular PRNG, 
which is nof a CSPRNG) 

LCI Locafion Configurafion Informafion (dafa represenfing fhe locafion—geo¬ 
graphical or civic—of a sysfem) 

LCI Logical Channel Idenfifier (in circuif swifching, idenfifier for a virfual 
channel) 

LCN Logical Channel Number (in circuif swifching, number of a virfual 
channel) 

LCP Link Confrol Protocol (in PPP, used to esfablish a link) 

LDAP Lighfweighf Directory Access Profocol (a lookup protocol based on fhe 
ISO X.500 DAP profocol) 

LDRA Lighfweighf DHCP Relay Agenf (mechanisms to allow layer 2 devices fo 
acf as DHCP relay agenfs) 

LEAP Lighfweighf Exfensible Aufhenficafion Profocol (Cisco's EAP mefhod 
using WEP or TRIP keys; now known fo have vulnerabilifies) 

LLA Link Layer Address (in PMIPv6, a mobilify header opfion fo indicafe link 
layer address) 

LLC Logical Link Confrol (sublayer of fhe MAC layer relafed fo link confrol) 

LLMNR Link Local Mulficasf Name Resolufion (a mulficasf varianf of DNS 
designed for on-link use and fhaf runs on a differenf porf number fhan DNS; 
used for local service and node discovery) 

LMQI Lasf Member Query Interval (in IGMP and MLD, fhe fime befween 
group-specific query messages) 

LMQT Lasf Member Query Time (in IGMP and MLD, fhe fofal spenf after 
sending a lasf member query and possible fransmissions; represenfs fhe 
"leave lafency") 

LNP Local Nefwork Profecfion (a collecfion of fechniques suggesfed for use in 
IPv6 deploymenfs making NATs unnecessary) 

LoST Locafion-fo-Service Translafion (a framework for offering services based 
on locafion—e.g., indicafion of fhe nearesf hospifal) 

LQR Link Qualify Reporfs (in PPP, reporfs of link qualify measuremenfs 
including number of packefs received, senf, and rejected due fo errors) 
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LTE Long-Term Evolution (3GPP wireless broadband standard; an evolution of 
HSPA) 

LW-MLD Lightweight MLD (variant of MLD with simpler join/leave 
semantics) 

MAC Media Access Control (controls for mediating access to a shared network 
medium, usually a portion of the link layer protocol) 

MAC Message Authentication Code (a mathematical function used to help 
verify the integrity of a message) 

MAN Metropolitan Area Network (a network spanning a modest geographical 
extent, such as a city or region) 

MCS Modulation and Coding Scheme (combination of modulation and coding, 
many combinations are available in 802.11n) 

MD Message Digest Algorithms (mathematical functions giving a short 
numeric "fingerprint" for a larger message) 

mDNS Multicast DNS (local variant of name service developed by Apple) 

MIH Media-Independent Handoff (mechanisms to support change of network 
attachment point between heterogeneous networks; the IEEE 802.21 standard 
covers MIH for 802.3, 802.11, 802.15,802.16, 3CPP, and 3GPP2 network types) 

Mil Media-Independent Interface (in hardware, the interface between 
the MAC implementation and PHY protocol implementation, which is 
PHY-independent) 

MIME Multipurpose Internet Mail Extensions (method for labeling and encod¬ 
ing various object types in electronic mail) 

MIMO Multiple Input, Multiple Output (wireless antenna scheme with mul¬ 
tiple antennas offering performance superior to single-antenna systems but 
requiring more sophisticated signal processing) 

MIP Mobile IP (IP addressing and routing extensions to support movement of 
network attachment point without address change) 

MITM Man-in-the-Middle attack (the typical form of an MSM attack, carried 
out by an interposer) 

MLD Multicast Listener Discovery (used by IPv6 routers to discover multicast 
receivers on a link; provides similar capabilities as ICMP for IPv4) 

MLPP Multilevel Precedence and Preemption (telephone scheme to prioritize 
calls—e.g., for military use) 

MMRP Multiple MAC Registration Protocol (part of MRP used for registering 
multicast interest) 

MN Mobile Node (the moving node in a MIP scenario) 
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MOBIKE Mobile version of IKE (enhancements to IKE to support mobility and 
change of addressing information) 

MODE Modulo-P groups (groups based on modular arithmetic, in the abstract 
algebraic sense, used with key establishment protocols) 

MoS Mobility Services (portion of the IEEE 802.21 standard supporting media- 
independent handoff services) 

MP Mesh Point (name of a node in IEEE 802.11s operating in a mesh 
configuration) 

MP, MPPP, MLP, MLPPP Multi-link PPP (using PPP over multiple links 
simultaneously) 

MPDU MAC Protocol Data Unit (name of the frame used in 802.11 standards) 

MPE Manchester Phase Encoding (bit encoding scheme where a voltage transi¬ 
tion indicates one bit) 

MPLS Multi-Protocol Label Switching (architecture that switches frames based 
on tag values, not IP addresses) 

MPPC Microsoft's Point-to-Point Compression (used with PPP) 

MPPE Microsoft's Point-to-Point Encryption (used with PPP) 

MPV Maximum Pad Value (in PPP, maximum number of pad bytes) 

MRD Multicast Router Discovery (protocol to discover on-link multicast router 
neighbors) 

MRP Multiple Registration Protocol (IEEE 802.1ak standard for registering 
attributes) 

MRRU Multilink Maximum Received Reconstructed Unit (MRU after recon¬ 
struction from parts on multiple MP links) 

MRU Maximum Receive Unit (largest packet/message size a receiver will 
accept) 

MS-CHAP Microsoft's Challenge-Handshake Authentication Protocol (an 
authentication protocol involving a request/replay and validated response, 
with two versions: MS-CHAPvl and MS-CHAPv2) 

MSDU MAC Services Data Unit (802.11 frame type available to layers above 
MAC) 

MSK Master Session Key (a key derived after an EAP session using methods 
supporting key derivation) 

MSL Maximum Segment Lifetime (in TCP, the maximum time a segment can 
exist in the network before being determined invalid) 

MSM Message Stream Modification (active modification of messages; usually a 
type of attack) 
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MSS Maximum Segment Size (in TCP, the largest segment a receiver is willing 
to receive; usually provided in an option during connection establishment) 

MTU Maximum Transmission Unit (maximum frame size a network will 
transport) 

MVRP Multiple VLAN Registration Protocol (part of MRP used for registering 
VLANs) 

MX Mail Exchanger (DNS RR indicating a priority order of hosts willing to use 
SMTP to exchange mail) 

NAC Network Access Control (process employed to determine whether a 
device should receive access rights to use a network) 

NACK Negative Acknowledgment (an indication of non-receipt or 
non-acceptance) 

NAP Network Access Protection (Microsoft's variant of NAC; first available 
with Windows Server 2008) 

NAPT NAT with Port Translation (NAT with port re-writing, the most com¬ 
mon form of NAT) 

NAPTR Name Authority Pointer (DNS RR used with a DNS-based DDDS for 
holding re-writing rules) 

NAR New Access Router (in FMIPv6, router that is expected to be used soon) 

NAT Network Address Translation (mechanism to re-write addresses in 
IP datagrams; used primarily to reduce the usage of globally routable IP 
addresses; usually used in conjunction with private IP addresses; also sup¬ 
ports a type of firewall capability) 

NAT64 IPv6/IPv6 NAT (a NAT that translates between IPv4/ICMPv4 and 
IPv6/ICMPv6 and vice versa; proposed for IPv6/IPv4 interoperability and 
coexistence) 

NAT-PMP NAT Port Mapping Protocol (an alternative to IGD developed by 
Apple for configuring some NAT devices; provides the ability to remotely set 
up port forwarding) 

NAT-PT NAT with Protocol Translation (now-deprecated approach to IPv4/ 
IPv6 translation) 

NAV Network Allocation Vector (time delay before sending due to other sta¬ 
tions' channel use in 802.11 DCF) 

NBMA Non-Broadcast Multiple Access (multi-user networks lacking broad¬ 
casting capability) 

NCoA New Care-of Address (in FMIPv6, CoA to be obtained from NAR) 

NCP Network Control rotocol (in PPP, used to establish the network-layer 
protocol) 
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ND, NDP Neighbor Discovery (IPv6 method to discovery and obtain MAC 
address of on-link neighbors; works like ARP; implemented as part of 
ICMPv6) 

NEMO Nefwork Mobilify (mobilify where a roufer and nefwork changes 
affachmenf poinf) 

NIC Nefwork Inferface Card (fhe device inferfacing a compufer wifh a 
nefwork) 

NONCE number used once (a random value used in many crypfographic pro¬ 
tocols to combaf replay affacks) 

NPT66 IPv6-fo-IPv6 NAPT (NAT wifh algorifhmic address and porf 
franslafion) 

NRO Number Resource Organizafion (fhe Address Supporfing Organizafion 
to ICANN) 

NS Name Server (DNS RR carrying fhe name of anofher name server) 

NS Neighbor Solicifafion (parf of IPv6 ND; similar fo an IPv4 ARP requesf buf 
uses IPv6 mulficasf addressing; implemenfed using ICMPv6) 

NS CD Name Services Cache Daemon (process fo provide caching for DNS and 
ofher resolufions popular on UNIX sysfems) 

NSEC Nexf Secure (DNS RR used wifh DNSSEC fo indicafe fhe nexf RR in an 
ordered lisf; used for aufhenficafed denial of exisfence) 

NSEC3 Nexf Secure (version 3) (DNS RR like NSEC buf including hash func- 
fion fo resisf DNS name enumerafion affacks) 

NSEC3PARAM NSEC Paramefers (DNS RR used wifh DNSSEC holding 
NSEC3 hash funcfion paramefers) 

NTN Non-Terminal NAPTR (in DNS, a NAPTR poinfing fo anofher domain 
wifh records) 

NTP Nefwork Time Protocol (a profocol for synchronizing clocks) 

NUD Neighbor Unreachabilify Defecfion (in IPv6 ND, fo defermine if a neigh¬ 
bor can sfill be reached) 

OCSP Online Cerfificafe Sfafus Profocol (a profocol for checking fhe validify of 
a cerfificafe; an alfernafive fo obfaining a CRL) 

OFDM Orfhogonal Frequency Division Mulfiplexing (a sophisficafed modula- 
fion scheme in which subcarriers of mulfiple frequences are simulfaneously 
modulated in a specified bandwidfh fo achieve high fhroughpuf; used by DSL, 
802.11a/g/n, 802.16e, and advanced cellular dafa sfandards including LTE) 

OID Objecf Idenfifier (numeric idenfifier of a digifal objecf; used in cerfificafe 
encodings) 

OLSR Opfimized Link Sfafe Roufing (a sfandard profocol for on-demand rouf- 
ing in ad-hoc nefworks) 
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OOB Out Of Band (information delivered outside a primary communication 
channel) 

ORO Option Request Option (in DHCP, an option indicating a systems interest 
in knowing which options are supported) 

OSI Open System Interconnect (an abstract reference model specified by ISO 
for open systems that helped form the basis of layered design in protocols) 

OUI Organizationally Unique Identifier (original MAC-layer address prefix 
format defined by IEEE) 

P2P Peer-to-Peer (participating systems are both clients and servers) 

PA Provider-Aggregatable (IP address space where a customer's prefix is given 
by their provider) 

PAA PANA Authentication Agent (PANA agent performing authentication, 
such as an AAA server) 

PaC PANA Client (PANA agent requesting authentication) 

PAD Peer Authentication Database (with IPsec, abstraction of database contain¬ 
ing authentication information for each peer such as use of IKE or PSK and 
associated authentication data) 

PANA Protocol for Carrying Authentication for Network Access (UDP/IP car¬ 
rier for EAP) 

PAP Password Authentication Protocol (protocol that carries cleartext pass¬ 
word; vulnerable to MITM or eavesdroppers) 

PAWS Protection Against Wrapped Sequence Numbers (in TCP, method using 
TSOPT values to notice sequence number wrapping) 

PCF Point Coordinating Punction (combined contention-free and contention- 
based MAC protocol for 802.11; not widely used) 

PCO Phased Coexistence Operation (method for an 802.11 AP to switch chan¬ 
nel widths for less negative impact on legacy equipment) 

PCoA Previous Care-of Address (in PMIPv6, current or previous CoA obtained 
from PAR) 

PCP Port Control Protocol (current-generation draft lETP protocol for configur¬ 
ing NATs including SPNATs and NAT64) 

PDU Protocol Data Unit (describes a message at some protocol layer; some¬ 
times used interchangeably and informally with packet, frame, datagram, 
segment, or message) 

PEAP Protected Extensible Authentication Protocol (a popular method to 
encapsulate EAP in TLS; similar to EAP-TTLS) 

PEN Private Enterprise Number (numbers assigned by lANA usable by an 
enterprise in forming OIDs) 
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PFC Protocol Field Compression (in PPP, eliminating the Protocol field to reduce 
overhead) 

PFS Perfect Forward Secrecy (in public key cryptography, the property by 
which compromise of one key leads af mosf fo fhe compromise of dafa 
encrypfed wifh fhaf key and nof ofher dafa or keys) 

PHB Per-Hop Behavior (absfracf behavior af roufer used fo implemenf DS) 

PHY Physical (a layer in fhe OSI; usually describes connectors, frequencies, 
coding, and modulafion) 

PI Provider-lndependenf (IP address space owned by a cusfomer; nof derived 
from an ISP's address prefix) 

PIM Profocol Independenf Mulficasf (non-local mulficasf roufing profocol fhaf 
can leverage unicasf roufing profocols' dafa and operafions) 

PIO Prefix Informafion Opfion (in ICMPv6, an opfion carrying an IP address 
prefix) 

PKC Public Key Cerfificafe (a digifal objecf including a public key and signa- 
fure from a CA, along wifh various usage policies and parameters) 

PKCS Public Key Cryptography Sfandards (mefhods fo encode and represenf 
public key and related material) 

PKI Public Key Infrasfrucfure (system for managing and disfribufing public 
keys) 

PLCP Physical Layer Convergence Procedure (802.11 mefhod for encoding and 
defermining frame type and radio parameters) 

PMTU Path MTU (minimum MTU across links on the path from sender fo 
receiver) 

PMTUD PMTU Discovery (process of defermining fhe PMTU; usually 
depends on ICMP PTB messages) 

PNAC Porf-Based NAC (a version of NAC wherein fhe physical porf of affach- 
menf is used in making an aufhorizafion decision) 

PoE Power over Efhernef (carries device power over Efhernef wiring) 

POTS Plain Old Telephone Service (convenfional analog telephone service) 

PPP Poinf-fo-Poinf Profocol (a link-layer configurafion and dafa encapsulaf- 
ing profocol capable of carrying mulfiple nefwork layer profocols and using 
mulfiple underlying physical links) 

PPPoE PPP over Efhernef (mefhods fo esfablish a PPP associafion over an Efh¬ 
ernef link) 

PPTP Poinf-fo-Poinf Tunneling Profocol (Microsoff's link layer funneling 
profocol) 
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PRF Pseudorandom Function Family (a set of functions that cannot be distin¬ 
guished from truly random functions using a polynomial-time algorithm; also 
sometimes used less formally to refer to a single such function) 

PRNG, PRG Pseudo-Random Generator (a mathematical function used to com¬ 
pute a series of random-appearing values) 

PSK Pre-Shared Key (pre-placing encryption keys; no dynamic key exchange 
protocol used) 

PSM Power Save Mode (a mode of 802.11 where devices may "sleep" when not 
busy and poll to receive their information from an AP at a later time) 

PSMP Power-Save Multi-Poll (bi-directional version of APSD, part of 802.11n) 

PTB Packet Too Big (a ICMP Destination Unreachable Fragmentation Required 
or IPv6 Packet Too Big message indicating a packet is too large for the next- 
hop MTU size) 

QAM Quadrature Amplitude Modulation (combination of phase and ampli¬ 
tude modulation) 

QBSS QoS BSS (an 802.11 BSS enhanced with 802.11e or 802.11n QoS features) 

QI Query Interval (in IGMP and MLD, time between general queries) 

QoS Quality of Service (general term describing how traffic can be handled 
differently, usually with better or worse latency or drop precedence, based on 
configuration parameters) 

QPSK Quadrature Phase Shift Keying (typically, modulating two bits per sym¬ 
bol typically using four signal phases, although more advanced versions with 
more bits per symbol are possible) 

QQI Querier's Query Interval (in IGMP and MLD, time between sending 
general query messages; current non-querier multicast routers adopt the most 
recently received QQI value as their QI value) 

QQIC Querier's Query Interval Code (in IGMP and MLD messages, encoding 
of the QQI value) 

QRI Query Response Interval (in IGMP and MLD, the maximum amount of 
time a receiver is permitted to send a response to a query) 

QRV Querier Robustness Variable (in IGMP and MLD, sets number of 
retransmissions) 

QS Quick Start (in TCP, an experimental modification for faster startup behav¬ 
ior provided devices on the path agree) 

QSTA QoS STA (an 802.11 STA supporting QoS capabilities) 

RA Router Advertisement (message indicating presence of an on-link router 
neighbor; uses ICMP) 
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RADIUS Remote Authentication Dial-In User Service protocol (a popular pro¬ 
tocol for carrying AAA data) 

RAID Relay Agent Information Option (in DHCPv6, an option used by relays 
to insert various bits of informafion) 

RARP Reverse ARP (protocol providing nefwork layer to MAC layer address 
mappings) 

RAS Remofe Access Server (a server fhaf handles remofe users—aufhenfica- 
fion, access confrol, efc.) 

RC4 Rivesf Cipher #4 (a popular symmefric key encrypfion scheme designed 
by Ron Rivesf) 

RD Router Discovery (procedure to locafe a proximal router; uses ICMP) 

RDATA Refurned Dafa (parf of fhe DNS protocol used fo hold refurned dafa) 

RDNSS Recursive DNS Server (used in RAs; indicates address of DNS server) 

RED Random Early Defecfion (an AQM scheme fhaf marks or drops pack- 
efs wifh increasing probabilify when persisfenf congesfion appears fo be 
growing) 

RFC Requesf for Commenfs (documenfs published by IETF; some are 
sfandards) 

RGMP Roufer-porf Group Managemenf Protocol (Cisco's protocol fo enable 
IGMP snooping) 

RH Roufing Header (an IPv6 extension header fhaf alters fraffic delivery pafh) 

RHBP Rafe Halving wifh Bounded Pacing (in TCP, an evolved version of fhe 
PACK algorifhm fo help spread refransmissions more evenly across an RTT 
period after inferred packef loss) 

RIP Roufing Informafion Protocol (small organizafion roufing protocol; fhe 
original version does nof supporf subnef masks) 

RIR Regional Infernef Regisfry (allocates address space for some region of fhe 
world) 

RO Roufe Opfimizafion (improving roufes from indirecf "dogleg" pafhs used 
in simple MIP) 

ROAD Running Ouf of Address Space (a problem mofivafing fhe creafion of 
IPv6 and resulfing in fhe creafing of CIDR) 

ROHC Robusf Header Compression (currenf-generafion sfandards for protocol 
header compression) 

RP Rendezvous Poinf (used wifh mulficasf roufing fo exchange group 
informafion) 

RPC Remofe Procedure Call (a framework supporfing a program's procedure 
calls fo be handled remofely) 
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RPF Reverse Path Forwarding (to avoid loops, an RPF check is performed by 
multicast routers to ensure a multicast datagram arrives on the same interface 
used to reach the sender) 

RPSL Routing Policy Specification Language (a language used to express rout¬ 
ing policies such as which network prefix corresponds to which owning AS) 

RR Resource Record (a typed information block owned by a domain name and 
distributed via DNS) 

RRP, RR Return Routability/Procedure (a check used with MIPv6 to ensure a 
mobile node is authentic, and includes a HoA check and CoA check) 

RRset Resource Record Set (a collection of DNS RRs with same domain name 
owner and class) 

RRSIG Resource Record Signature (DNS RR used with DNSSEC holding a 
signature on an RRset) 

RS Router Solicitation (an ICMP message that induces a router to produce a 
response) 

RSA Rabin, Shamir, Adelman (the most popular public key cryptography 
algorithm) 

RSN Robust Security Network (improved security in IEEE 802.11i/WPA; 
included in 802.11 standard) 

RSNA RSN Association (full use/implementation of RSN) 

RST Reset (a TCP header bit and segment type that causes a TCP connection 
abort) 

RSTP Rapid Spanning Tree Protocol (decreased latency version of STP) 

RTO Retransmission Timeout (time before retransmitting data thought to be 
lost) 

RTS Request To Send (message indicating desire to send a subsequent 
message) 

RTT Round Trip Time (minimum time to expect a response from a communi¬ 
cation peer) 

RTTM RTT Measurement (an instantaneous estimate of the RTT) 

RTTVAR RTT Variance (in TCP, time-averaged estimate of a connection's RTT 
deviation) 

RTX Retransmission (re-sending of data) 

RW Restart Window (in TCP, CWND value when TCP restarts sending after an 
idle period) 

SA Security Association (in IPsec, state pertaining to a unidirectional associa¬ 
tion between peers; includes agreed-upon keys, algorithms, etc.; an SA can be 
unicast or multicast) 
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SACK Selective Acknowledgment (in TCP, an option indicating correctly 
received out-of-sequence data) 

SAD Security Association Database (in IPsec, abstraction of dafabase confain- 
ing informafion on each acfive SA; logically indexed by SPI) 

SAE Simulfaneous Aufhenficafion of Equals (form of aufhenficafion used wifh 
802.11s) 

SAP Session Announcemenf Profocol (carries experimenfal mulficasf session 
announcemenfs; see also SDP) 

SCSV Signaling Cipher Suife Value (in TLS, a CS value fhaf indicafes nof a CS 
buf a parficular sef of alfernafive funcfions or opfions) 

SCTP Sfream Confrol Transporf Profocol (a reliable fransporf profocol alferna¬ 
five fo TCP fhaf does nof enforce sfricf ordering and supporfs mulfiple sub- 
sfreams and endpoinf address changes) 

SC VP Server-Based Cerfificafe Verificafion Profocol (a profocol supporfing 
DPD and DPV for cerfificafes) 

SDID Signing Domain Idenfifier (wifh DKIM, name for fhe domain of fhe 
signer) 

SDLC Synchronous Dafa Link Confrol (a precursor fo HDLC, fhe link layer of 
SNA) 

SDO Sfandards-Defining Organizafion (including IEEE, lETE, ISO, ITU, 3GPP, 
3GPP2) 

SDP Session Descripfion Profocol (a profocol fhaf describes mulfimedia 
sessions) 

SEND Secure Neighbor Discovery (a secure varianf of ND using GGAs) 

SEP Secure Enfry Poinf (in DNSSEG, indicafes a DNSKEY RR confains a KSK) 

SFD Sfarf Frame Delimifer (bif paffern indicafing fhe sfarfing porfion of frame 
in a link PDU) 

SG Securify Gafeway (wifh IPsec, sysfem ferminafing IPsec profocols, of fen af 
nefwork edge) 

SHA Secure Hash Algorifhm (one of a sef of hashing algorifhms suifable for 
ensuring message infegrify) 

SIFS Shorfs Infer-Frame Space (smallesf amounf of fime befween an 802.11 
frame and ifs AGK) 

SIIT Sfafeless IP/IGMP Translafion (a framework for franslafion befween IPv4 
and IPv6, including special rules for IGMP franslafion, NAT64, and DNS64) 

SIP Session Inifiafion Profocol (general signaling profocol; used wifh VoIP) 

SLAAC Sfafeless Address Aufoconfigurafion (a mechanism whereby a node 
self-configures ifs own IP address; usually applies fo IPv6 nodes) 
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SLLAO Source Link-Layer Address Option (in ICMPv6, an option carrying the 
sender's link layer address) 

SMSS Sender's MSS (the MSS for a connection as viewed by the sender) 

SMTP Simple Mail Transfer Protocol (a protocol to carry e-mail in transit 
among mail transfer agents) 

SNA Systems Network Architecture (IBM's network architecture) 

SNAP Subnetwork Access Protocol (IEEE terminology for 802.2 encapsulation; 
rare for TCP/IP networks) 

S-NAPTR Straightforward NAPTR (simplified NAPTR where AUS maps 
directly to result without regular expression substitution) 

SNMP Simple Network Management Protocol (status reporting and configura¬ 
tion settings for network equipment; usually used with UDP/IP) 

SOA Start of Authority (DNS RR indicating meta-data about a zone) 

SOAP (formerly) Simple Object Access Protocol (a web services application 
protocol using XML, which provides RPC-like capabilities; SOAP is no longer 
an acronym) 

SPD Security Policy Database (with IPsec, abstraction of database containing 
security policies applying to how traffic is handled—e.g., discard, bypass, or 
protect) 

SPI Security Parameter Index (in IPsec, a logical index into the SAD to indicate 
security parameters, either 32 or 64 bits) 

SPNAT, CGN, LSN Service-Provider ("large scale") NAT (a NAT deployment 
arrangement where address translation is performed by a service provider 
instead of a customer) 

SRP Secure Remote Password (a strong key agreement protocol based on pass¬ 
words; being supported by various security protocols such as TLS and EAP) 

SRTP Secure Real-Time Protocol (a secure variant of the UDP/IP based real¬ 
time protocol; typically used to carry multimedia information) 

SRTT Smoothed RTT (in TCP, time-averaged estimate of a connection's RTT) 

SSDP Simple Service Discovery Protocol (an lETP-specified distributed service 
discovery protocol designed for LANs and residential networks used by 
UPnP) 

SSH Secure Shell Protocol (secure remote login/execution protocol; also sup¬ 
ports tunneling of other protocols) 

SSID Service Set Identifier (802.11 network name) 

SSL Secure Sockets Layer (encrypted and integrity-protected layer above TCP; 
precursor to TLS) 
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SSM Single-Source Multicast (multicast wherein only a single party can source 
traffic fo a parficular group) 

STA Sfafion (IEEE 802.11 ferminology for an access poinf or associafed wireless 
hosf) 

STP Spanning Tree Profocol (profocol used among bridges and swifches fo 
avoid loops) 

STUN Session Traversal Ufilifies for NAT (a clienf/server profocol for helping 
fo fix fhe address and porf number of a fraffic flow when passing fhrough a 
NAT) 

SWS Silly Window Syndrome (in protocols using window-based flow confrol, 
an undesirable sifuafion where small amounfs of dafa are exchanged due fo 
fhe use of small window sizes) 

SYN Synchronize (a TCP header bif and firsf segmenf fype senf on a TCP 
connecfion) 

TCP Transmission Confrol Profocol (a connecfion-orienfed reliable sfream 
profocol lacking message boundaries, which includes flow and congesfion 
confrol) 

TCP-AO TCP Aufhenficafion Opfion (in TCP, an algorifhm-agile mechanism fo 
combaf MSM affacks) 

TDES, 3DES Triple DES (encrypfion using fhree rounds of DES enciphermenf, 
resulfing in an effecfive key lengfh of 112 bifs) 

TDM Time Division Mulfiplexing (sharing by allocafion of separafe usage fime 
slofs) 

TEC Traffic Elow Confidenfialify (in IPsec, mefhods fo disguise fhe fraffic flow 
even when encrypfed, including padding and generafion of dummy packefs) 

TFRC TCP Friendly Rafe Confrol (mefhods fo confrol fhe sending rafe of a 
profocol so as fo nof compefe unfairly wifh a TCP flow in a similar operafing 
environmenf) 

TFTP Trivial File Transfer Profocol (UDP/IP-based simple fransfer profocol) 

TKIP Temporal Key Infegrify Profocol (replaced fhe WEP encrypfion algorifhm 
for WPA) 

TED Top-Level Domain (a fop-level domain name such as EDU, COM, UK, ZA) 

TLS Transporf Layer Securify (based on fhe SSL profocol developed by 
Nefscape) 

TLV Type/Lengfh Value (used in profocols; indicates a type, length of variable- 
lengfh value, and fhe value) 

ToS Type of Service (older name for fhe IPv4 header byfe indicafing fype of 
service; replaced wifh DS Field and ECN bifs) 
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TS Traffic Selector (with IKE, specifications for identifying traffic such as IP 
address range, port number, etc.) 

TSER, TSecr Timestamp Echo Reply (in TCP, portion of TSOPT used to echo 
TSV value to peer) 

TSF Time Synchronization Punction (establishes a common time in an 802.11 
BSS) 

TSIG Transaction Signatures (signatures used to secure individual DNS trans¬ 
actions, not content from its origin) 

TSOPT Timestamps Option (in TCP, an option including the TSV and TSER 
values) 

TSPEC Traffic Specification (a structure indicating traffic parameters for 802.11 
QoS) 

TSV Timestamp Value (in TCP, portion of TSOPT used to identify the sender's 
time—used in RTTM and PAWS) 

TTL Time-to-Live (IPv4 header field indicating number of remaining router 
hops allowed for a datagram) 

TURN Traversal Using Relay NAT (a protocol in which a third party relays 
information between hosts that are otherwise unable to communicate due to 
the presence of one or more NATs) 

TWA Time-Wait Assassination (in TCP, an erroneous condition caused by 
receiving certain segments during TIME-WAIT state) 

TXOP Transmission Opportunity (in 802.11, a form of "credit" allowing a sta¬ 
tion to send one or more frames) 

TXT Text (DNS RR carrying descriptive text; used by DKIM) 

UBM Unicast Prefix-based Multicast addressing (deriving multicast addresses 
based on assigned unicast prefixes) 

UDL Unidirectional Link (link providing communication in only one 
direction) 

UDP User Datagram Protocol (a best-effort message protocol with message 
boundaries and lacking congestion or flow control) 

UEQM Unequal Modulation (using different modulation schemes on different 
data streams simultaneously) 

ULA Unique Local IPv6 Unicast Addresses (private addresses used with IPv6, 
allocated from the fc00::/7 prefix) 

U-NAPTR URI-enabled NAPTR (simplified NAPTR allowing limited regular 
expression substitution) 

U-NII Unlicensed National Information Infrastructure (unlicensed radio spec¬ 
trum in much of the world) 
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UNSAF Unilateral Self-Address Fixing (heuristics used in an attempt to deter¬ 
mine how a traffic flow is identified after passing through a NAT; a fragile 
process for which techniques like ICE are recommended alternatives) 

UP User Priority (802.11 priorities; based on same terminology from 802.1d) 

UPnP Universal Plug and Play (a protocol framework for device and service 
discovery aimed at the residential user; standardized by the UPnP Forum) 

URG Urgent Mechanism (in TCP, a method for marking and indentifying 
information as "urgent"; not recommended for use) 

URI Universal Resource Identifier (string of characters identifying a name or 
resource on the Internet, including URLs and URNs) 

URL Uniform Resource Locator (informally, a "WWW address") 

URN Universal Resource Name (a URI using the urn scheme not implying 
availability of resource) 

USRK Usage-Specific Root Key (key derived from an EMSK intended to be 
used for certain purposes) 

UTC Coordinated Universal Time (standard time used by NTP and other 
protocols; effectively interchangeable with GMT but with some technical 
differences) 

UTO User Timeout (in TCP, the maximum time a TCP sender will wait 
attempting to retransmit before abandoning a connection) 

VC Virtual Circuit (a simulated dedicated communication path) 

VLAN Virtual LAN (used most often to simulate multiple distinct LANs on 
shared wiring) 

VLSM Variable-Length Subnet Masks (proximal use of subnet masks of differ¬ 
ing lengths in same environment) 

VoIP Voice over IP (the carriage of voice traffic over IP networks, usually 
involves SIP signaling) 

VPN Virtual Private Network (virtually isolated network; often encrypted) 

W3C World Wide Web Consortium (SDO defining web standards such as 
XML) 

WAN Wide Area Network (a network connecting geographically distributed 
sites; usually involving multiple administrative authorities) 

WEP Wired Equivalent Privacy (original WiFi encryption; found to be cata¬ 
strophically weak) 

WESP Wrapped ESP (in IPsec, a method to prepend ESP with a header to 
indicate if the following traffic is encrypted or only authenticated; useful for 
inspection by middleboxes) 
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Wi-Fi Wireless Fidelity (IEEE 802.11 wireless LAN standard) 

WiMAX Worldwide Interoperability for Microwave Access (IEEE 802.16 wire¬ 
less broadband standard) 

WKP Well-Known Prefix (a checksum neutral IPv6 prefix, 64:ff9b::/96, used in 
algorithmic mappings between IPv4 and IPv6 addresses) 

WLAN Wireless LAN (a wireless LAN such as WiPi) 

WMM Wi-Ei Multimedia (subset of 802.11e QoS functions now available in 
802.11n) 

WoL Wake on LAN (method to remain in "sleep" mode until a particular 
packet is received) 

WPA WiPi Protected Access (802.11 encryption method) 
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ogy until widespread use of TCP/IP) 
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XMPP Extensible Messaging and Presence Protocol (an open, extensible, 

HTML-based protocol for the exchange of messages, presence, and contact list 
information) 

ZSK Zone Signing Key (a key used with DNSSEC for signing zone contents, 
usually signed by a KSK) 
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Backoff time, in MAC, 121-122 
Backup ports, STP, 104-105 

BACP (Bandwidth Allocation Control Protocol), 139, 
935 

Bandwidth Allocation Protocol (BAP), 139 
Bandwidth (capacity) 
allocating in MP, 139 
buffer bloat and, 781 
connections and, 3 

Bandwidth-delay product. See BDP (bandwidth- 
delay product) 

Bandwidth on demand (BOD), 139, 936 
Bandwidth-scalable TCPs, 773 
Bank teller's algorithm, 138 
BAP (Bandwidth Allocation Protocol), 139 
Baran, Paul, 1 

Basic Encoding Rules (BER), 935 

Basic service set. See BSS (basic service set) 

BCMCS (Broadcast and Multicast Service Control¬ 
ler), 935 



Index 


967 


BCP (best current practice) category, RFCs and, 23 
BDP (bandwidth-delay product) 
congestion control and, 730 
high-speed networks and, 770 
HSTCP (Highspeed TCP) and, 772 
BER (Basic Encoding Rules), 935 
BER (bit error rate) 

data frame fragmentation, 117-118 
definition of, 935 

Berkeley Internet Name Domain v. 9 (BIND9), 935 
Berkeley sockets 

half-close support, 598 

incoming connection queue and, 636, 639-640 
popular APIs, 22 

restrictions on foreign endpoints, 635 
state transitions, 618 
TCP ports, 588 

TCP_NODELAY option for disabling Nagle 
algorithm, 700 

Berkeley Software Distribution. See BSD (Berkeley 
Software Distribution) 

Best current practice (BCP) category, RFCs and, 23 

Best-effort delivery, of packets, 7 

Better-than-Nothing Security (BTNS), 852, 936 

BGP (Border Gateway Protocol), 935 

BI (binary increase), 774 

BIC (Binary Increase Congestion Control) 

BIC-TCP, 773-774 
overview of, 772-773 

Bidirectional tunneling, in mobile IP, 216-217 

Big endian byte ordering, 183 

Binary additive increase algorithm, 773-774 

Binary exponential backoff, retransmission and, 650 

Binary increase (BI), 774 

Binary Increase Congestion Control (BIC) 

BIC-TCP, 773-774 
overview of, 772-773 
Binary notation 

expressing IP addresses in, 32-33 
prefixes, 48 
of subnet masks, 39 
Binary phase shift keying (BPSK) 
definition of, 936 

higher throughput (802.11n) support and, 128 
Binary search increase algorithm, BIC-TCP and, 
773-774 

BIND9 (Berkeley Internet Name Domain v. 9), 935 
Binding method, in STUN, 321 
Binding, MNs (mobile nodes), 216-217 
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DCHP relay agents, 269-270 
definition of, 936 

Black hats, attacks related to Internet architecture, 26 
Black holes, in PMTUD, 613 
Blackhole route messages, ICMPv6, 372 
Block ciphers, 811 

Blocking route messages, ICMPv6, 372 
Bloop attacks, 429 

BOD (bandwidth on demand), 139, 936 
Bombs, ICMP attacks, 428 
Bonding, link aggregation and, 92-93 
Boot Server Discovery Protocol (BSDP), 246, 936 
BOOTP (Internet Bootstrap Protocol) 
compatibility with DHCP, 236-238 
definition of, 936 
DHCP based on, 235 
options, 238-239 
relay agents, 268 
BOOTREQUEST, 239, 242 
Border Gateway Protocol (BGP), 209, 935 
Bot attacks, 26 
Bot herders, 806 
Botnets 

attacks related to Internet architecture, 26 
taking control of computers, 806 
BPDUs (Bridge PDUs) 

building the spanning tree, 107 
definition of, 936 

RSTP (Rapid Spanning Tree Protocol), 110-111 
STP and, 104 
structure of, 105-107 
viewing with Wireshark, 109 
BPSK (binary phase shift keying) 
definition of, 936 

higher throughput (802.11n) support and, 128 
Bridge PDUs. See BPDUs (Bridge PDUs) 

Bridges 

layer 2 relay agents, 270 

overview of, 98-102 

STP. See STP (Spanning Tree Protocol) 

Bridging Broadband (B4), in DS-Lite, 340, 935 
Broadcast addresses 
overview of, 15 
setting/finding, 437-439 
structure of, 42-43 

Broadcast and Multicast Service Controller 

(BCMCS), in cellular networks, 239, 935 



968 


Index 


Broadcast domain, link-layer broadcast, 167 
Broadcasting 

introduction to, 435-436 
overview of, 436-437 
sending broadcast datagrams, 439-441 
setting/finding broadcast addresses, 437-439 
Brute-force attacks, 816 
BSD (Berkeley Software Distribution) 
definition of, 936 
standards and, 24 
Tahoe release, 737 

BSDP (Boot Server Discovery Protocol), 246, 936 
BSS (basic service set) 
definition of, 936 
QoS BBS, 122 
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in IKE, 869-870 

CGAs (cryptographically generated addresses) 
definition of, 937 

Handover Key Request/Reply options, 422-423 
neighbor discovery options in IPv6, 414-415 
RSA Signature option, 415-416 
securing IPv6 Neighbor Discovery, 292 
SEND (Secure Neighbor Discovery) and, 403-406 
verification of, 405 
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pseudorandom numbers, generators, and func¬ 
tion families, 815-816 

RSA (Rivest, Shamir, and Adleman) public key 
cryptography, 812-813 
signcryption, 814-815 

CS (candidate sets), in source address selection, 
223-224 



972 


Index 


CS (cipher suites) 
definition of, 938 
overview of, 819-821 
in TLS, 878 

CS (Class Selector), 938 

CSMA/CA (carrier sense, multiple access with colli¬ 
sion avoidance) 
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DS (delegation signer) resource record in, 
897-898 

NSEC (NextSECure) resource record in, 898-901 

operation of, 902 

overview of, 894-896 

resolver operation example, 903-911 
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ICMP Parameter Problem and, 379 
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overview of, 677-679 
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overview of, 4 
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key derivation in EAP, 838 
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ECN-Echo bit, 784 
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in MLD, 390 
MLD and, 459 

QRI (Query Response Interval), in IGMP/MLD, 468, 
954 

QRV (querier robustness variable), in IGMP/MLD, 
466-467,954 
QS (Quick-Start) 
definition of, 954 
IPv6 options for, 199 
QSTAs (QoS stations), 122-123, 954 
Quadrature amplitude modulation (QAM), 128, 954 
Quadrature phase shift keying (QPSK), 128, 954 
Qualifiers, DNS TXT records, 546 
Quality of service. See QoS (quality of service) 
Querier election, multiple multicast routers and, 466 
Querier robustness variable (QRV), 466-467, 954 
Querier Robustness Variable (QVR), 390 
Querier's Query Interval Code. See QQIC (Querier's 
Query Interval Code) 

Querier's Query Interval (QQI), in IGMP/MLD, 954 
Queries, DNS, 526 

Queries/informational messages, ICMP 
Echo Request/Reply messages, 380-383 
Home Agent Discovery Request message, 386 
MIPv6 fast handover messages, 388 
MLD extension messages, 390-394 
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MLD query/report/done messages, 388-390 
Mobile Prefix Solicitation message, 387-388 
MRD (Multicast Router Discovery), 394-395 
overview of, 380 

router solicitation and advertisement messages 
in ICMPv4, 383-385 

Query Interval (QI), in IGMP and MLD, 468, 954 
Query message 
ICMP, 388-390 
IGMP, 457-459 

Query/response, in DNS protocol, 518 
Query Response Interval (QRI), in IGMP/MLD, 468, 
954 

Query types, resource record categories, 528 
Question (query) and zone section format, 526 
Queueing theory, in congestion control, 583 
Queues 

packets stored in, 4 

TCP server incoming connection queue, 636-640 
Quick acknowledgments, Linux, 733 
Quick-Start (QS) 
definition of, 954 
IPv6 options for, 199 

Quiet time concept, TCP state transitions, 624 
QVR (Querier Robustness Variable), 390 

R 

RA (Router Advertisement) 

Advertisement Interval option in ND, 412 
definition of, 954 
DHCPv6, 260-263 

Home Agent Information option in ND, 412-413 
ICMP attacks and, 429 
ICMP messages, 383-385 
ICMPv6 messages, 280-281 
link with NA (Neighbor Advertisement), 396 
MTU option in ND, 411-412 
neighbor discovery in IPv6, 396-398 
Prefix Information option, 410-411 
Route Information option in ND, 420 
Router Advertisement Flags Extension option, 
420-421 

Trust Anchor option, 417 
RADIUS (Remote Authentication Dial-In User 
Service) 

for access control, 141 
definition of, 955 

RAID (Relay Agent Information Option) 
definition of, 955 
DHCP, 268 

Random Early Detection (RED) gateways 

AQM (active queue management) and, 783-785 
definition of, 955 


Random numbers, in ND, 416-417 
Rapid Commit option, DHCP/BOOTP message 
format, 273-274 

Rapid Spanning Tree Protocol (RSTP), 103,110-111, 
956 

RARP (reverse ARP), 166, 955 
RASs (remote access servers) 
control decisions by, 141 
definition of, 955 
Rate-based flow control, 583 
Rate halving, for TCP congestion control, 741-742 
Rate-Halving with Bounding Parameters (RHBP) 
definition of, 955 

for TCP congestion control, 741-742 
Rate limiting, of ICMP messages in Linux server, 
369-370 
RC4 algorithm 
definition of, 955 
in Wi-Fi security, 129-130 
RD (Router Discovery) 
definition of, 955 
overview of, 383-385 
RDATA, in DNS resource record, 527, 955 
RDNSS (Recursive DNS Server) 
definition of, 955 
neighbor discovery in IPv6, 420 
Real-Time Protocol (RTP), 313 
Reassembly 

fragmentation and, 488 
of fragmented datagrams, 14,205 
timeout, 492 

Rebinding time (T2), for DHCP messages, 240 
Receive window structure, sliding window protocol, 
701 

Reconfigure extension, DHCP, 273 
Record layer, in TLS 

DTLS (Datagram TLS), 891-892 
TLS (Transport Layer Security), 877 
Record markers, datagrams and, 5-6 
Record protocol, TLS, 878-880 
Recovery point, in TCP retransmission, 671 
Recur field, GRE tunnels and, 150 
Recursive DNS Server (RDNSS) 
definition of, 955 
neighbor discovery in IPv6, 420 
RED (Random Early Detection) gateways 

AQM (active queue management) and, 783-785 
definition of, 955 
Redirect messages, ICMP 
ICMP attacks and, 428 
overview of, 372-375 
Redirected Header option in ND, 411 
Reference model, of protocol suite, 1 
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Referenced connections, TCP reset segments and, 
625 

regedit program in Windows, setting keepalive 
time with, 797 

Regional Internet registries (RIRs) 

allocation of IP addresses and, 62-63 
definition of, 955 
Registered ports, 18 
Reject route message, ICMPv6,372 
Relative fairness, congestion control schemes and, 
769 

Relay Agent Information Option (RAIO) 
definition of, 955 
DHCP, 268 
Relay agents, DHCP 

layer 2 devices, 270-271 

leasequery and bulk leasequery, 269-270 

overview of, 267 

RAIO (Relay Agent Information Option), 268 
Remote-ID and IPv6 Remote-ID, 268 
Server Identifier Override, 268-269 
Relayed transport address, TURN, 326 
Reliability, TCP, 586-587 
Remote access servers (RASs) 
control decisions by, 141 
definition of, 955 

Remote Authentication Dial-In User Service 
(RADIUS) 

for access control, 141 
definition of, 955 

Remote-ID, DCHP relay agents, 268 
Remote procedure call (RPC) 
definition of, 955 
SOAP and, 338 
Rendezvous point (RP) 
definition of, 955 
for multicast groups, 60 

Renegotiation, of cryptographic connection param¬ 
eters in TLS, 884 

Renewal time (Tl), for DHCP messages, 240 
Reno algorithm, 737-738 

Renumbering, allocation of IP addresses and, 63 
Repacketization, in TCP 
overview of, 586 

TCP timeout/retransmission, 686-687 
Replay attacks 

Nonce option in ND countering, 416-417 
types of threats to network communication, 808 
Report message, ICMP, 388-390 
REQUEST message, DHCPv6, 264-265, 269 
Request/response transactions, in STUN, 320 
Request to send. See RTS (request to send) 

Reserved addresses, IPv6 multicast, 61 


Reserved field, in IPv6 Fragment header, 203-204 
Reset (RST) segments, TCP 

aborting connections, 627-628 
definition of, 956 
half-open connections, 628-630 
incoming connection queue and, 637 
overview of, 625-626 

requesting connection to nonexistent port, 626 
TWA (TIME-WAIT Assassination), 630-631 
Resolver 

accessing DNS with, 511 

DNSSEC example of operation, 903-911 

UDP and, 525-526 

validating security aware resolver, 895 
Resource Record Set. See RRSet (Resource Record Set) 
Resource Record Signature resource record. See 

RRSIG (Resource Record Signature) resource 
record 

Resource records. See RRs (resource records) 
Resource utilization attacks 

attacks related to IGMP or MED, 469-470 
UDP-related attacks, 506 
Response algorithm, for spurious timeouts and 
retransmissions, 677 
Restart Window (RW), 739, 956 
Retransmission 
ARQ and, 580 
of packets, 7 

in TCP. See TCP timeout/retransmission 
timeout settings in communication protocols, 584 
Retransmission ambiguity problem, 655, 679 
Retransmission (RTX), 894, 956 
Retransmission timeout. See RTO (retransmission 
timeout) 

Retry bit. Frame Control Word, 116 
Return Routability Procedure (RRP), in MIP 
definition of, 956 
in RO, 218-219 

Reverse ARP (RARP), 166,955 
Reverse DNS queries, 536 

Reverse lookup, host names from IP addresses, 19 
Reverse Path Forwarding (RPF), 956 
RFC (Request for Comments), 23-24, 955 
RGMP (Router-port Group Management Protocol), 
469,955 

RH (Routing Header) 
definition of, 955 
in IPv6, 200-203 

RHBP (Rate-Halving with Bounding Parameters) 
definition of, 955 

for TCP congestion control, 741-742 
RIID field, IPv6 multicast addresses, 60-61 
Rijmen, Vincent, 811 
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Rijndael algorithm. See AES (Advanced Encryption 
Standard) 

RIP (Router Information Protocol), 955 
RIRs (regional Internet registries) 

allocation of IP addresses and, 62-63 
definition of, 955 

Rivest, Shamir, and Adleman. See RSA (Rivest, 
Shamir, and Adleman) 
rlogin (UNIX), precursor to SSH, 692 
RO (route optimization), in MIP 
definition of, 955 
in mobile IP, 217-219 

ROAD (Running Out of Address Space), 955 
Robust Header Compression (ROHC), 143,955 
Robust Security Network access (RSNA) 
definition of, 956 
in Wi-Ei security, 129 
Robust Security Network (RSN) 
definition of, 956 
in Wi-Ei security, 129 

Robustness/reliability, of IGMP and MED, 465-467 
ROHC (Robust Header Compression), 143, 955 
Roles, STP, 104-105 

Root bridge, building the spanning tree in STP, 107 
Root certificates, 822 
Root ports, STP, 104-105 
Rouge RAs, ICMP attacks and, 429 
Round-robin, DNS, 565-567 
Round-trip time. See RTT (round-trip time) 
Round-trip-time 
estimation, 584 
traceroute measuring, 377 
Route aggregation, 48-50 
Route Information option, in ND, 420 
Route optimization (RO), in MIP 
definition of, 955 
in mobile IP, 217-219 
Route Type identifiers, 201 
Router Advertisement. See RA (Router 
Advertisement) 

Router Alert, IPv6 options for, 198 
Router Discovery (RD) 
definition of, 955 
overview of, 383-385 
Routing Header (RH) 
definition of, 955 
IPv6, 200-203 

Router Information Protocol (RIP), 955 
Router Requirements RPC, 23 
Router Solicitation. See RS (Router Solicitation) 
Router solicitation and advertisement messages in 
ICMPv4, ICMP, 383-385 


Routers 

congestion of, 727-728 
crashes, 226 
default router, 208 

IGMP/MLD processing by multicast routers, 
457-459 
IP routers, 20 

multicast routing, 452-454 
between packet-switching networks, 1 
in small networks, 11-13 

Routing Policy Specification Language (RPSL), 65, 
956 

Routing protocols, 209 
Routing tables, 208, 439-441 
RP (rendezvous point), in IP Multicast 
definition of, 955 
for multicast groups, 60 
RPC (remote procedure call) 
definition of, 955 
SOAP and, 338 

Reverse Path Porwarding (RPF), 956 
RPSL (Routing Policy Specification Language), 65, 
956 

RRP (Return Routability Procedure), in MIP 
definition of, 956 
in RO, 218-219 

RRs (resource records), in DNS 

address and name server records, 529-530 

CNAME (canonical name) records, 534-536 

definition of, 956 

in DNS message format, 520-521 

dynamic DNS updates and, 555-557 

ENUM records, 551-552 

example using resource record types, 530-534 

MX (mail exchanger) records, 544-545 

NAPTR (name authority pointer) records, 549-551 

OPT (option) pseudo records, 547-548 

overview of, 527-529 

PTR (pointer ) records, 536-541 

S-NAPTR and U-NAPTR, 554-555 

SIP records, 552 

SOA (start of authority) records, 541-544 
SPF (sender policy framework) and TXT records, 
545-547 

SRV (service) records, 548-549 
translating DNS from IPv4 to IPv6, 569 
transparency and, 568 
URI/URN resolution, 553-554 
RRs (resource records), DNSSEC 
DNSKEY resource record, 896-897 
DS (delegation signer) resource record, 897-898 
NSEC (NextSECure) resource record, 898-901 
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RRs (resource records), DNSSEC, continued 
overview of, 896 

RRSIG (Resource Record Signature) rescource 
record, 901-902 
RRset (Resource Record Set) 
canonical ordering of, 902-903 
definition of, 956 

dynamic DNS updates and, 555-557 
overview of, 527 

RRSIG (Resource Record Signature) resource record 
definition of, 956 
DNSSEC, 901-902 
signed zones and zone cuts, 903 
RS (Router Solicitation) 
definition of, 956 
DHCPv6, 260-263 
ICMP attacks and, 429 
ICMP messages, 383-385 
ICMPv6 messages, 280 
link with NS (Neighbor Solicitation), 396 
neighbor discovery in IPv6, 396-398 
RSA (Rivest, Shamir, and Adleman) 
in cipher suites, 821 
definition of, 956 
ECC as alternative to, 815 
overview of, 812-813 
TLS extensions, 883 
RSA Signature option, in ND, 415-416 
RSN (Robust Security Network) 
definition of, 956 
in Wi-Fi security, 129 
RSNA (Robust Security Network access) 
definition of, 956 
in Wi-Fi security, 129 
RST. See Reset (RST) segments, TCP 
RSTP (Rapid Spanning Tree Protocol), 103,110-111, 
956 

RTO (retransmission timeout), in TCP 
classic method, 651-652 
clock granularity and RTO bounds, 654 
definition of, 956 
initial values, 654 
introduction to, 647 
Linux RTT estimation, 657-661 
retransmission ambiguity and Karn's algorithm, 
655 

robustness of RTTM to loss and reordering, 
662-664 

RTT estimation behaviors, 661-662 
RTTM (RTT Measurement) with Timestamps 
option, 656-657 
setting, 651 

slow start algorithm and, 732 


spurious. See Spurious timeouts and retransmis¬ 
sions, in TCP 

standard method, 652-654 
TCP connections and, 611 
RTP (Real-Time Protocol), 313 
RTS (request to send) 
carrier sense and, 121 
definition of, 956 
Wi-Fi control frames, 115 
RtSolPr (Proxy Router Solicitation), 388 
RTT (round-trip time) 

classic method of RTT estimation, 651-652 
clock granularity and RTO bounds, 654 
definition of, 956 
estimation behaviors, 661-662 
HSTCP (Highspeed TCP) and, 773 
initial values in RTO, 654 
Linux estimation of, 657-661 
Nagle algorithm and, 696-697 
retransmission timeout settings and, 584-585 
RTO based on, 648, 651 
standard method of estimating, 652-654 
"stop and wait' protocol and, 581 
STUN messages, 320 
TCP Timestamp option and, 610 
RTTM (RTT Measurement) 

robustness to loss and reordering, 662-664 
with Timestamps option, 656-657, 956 
RTTVAR (RTT Variance), in TCP, 685-686, 956 
RTX (Retransmission), 894, 956 
RW (Restart Window), in TCP, 739, 956 

s 

S-NAPTR (straightforward NAPTR) 
definition of, 958 
DNS resource record types, 554 
SACK (selective acknowledgement) 
definition of, 957 

DSACK (duplicate SACK) extension, 677-679 
example of retransmission with, 673-676 
fast retransmit and SACK recovery, 757-759 
receiver behavior, 672 
retransmission with, 647, 671-672 
sender behavior, 673 
for TCP congestion control, 740-741 
in TCP header, 589, 607 
SAD (security association database) 
definition of, 957 
in IPSec, 841-842 

SAE (Simultaneous Authentication of Equals) 
definition of, 957 
Wi-Fi mesh and, 130 
Salt, in cryptography, 816 
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SAP (Session Announcement Protocol) 
definition of, 957 
for multicast sessions, 55 
SAs (Security Associations), in IPsec 

CREATE_CHILD_SA exchange, 852-853 
definition of, 956 

GSAs (group security associations), 864 
in IPSec, 841 

payloads and proposals, 847-848 
proposed algorithms, 867-869 
Scalability, of DNS, 516 
Scope 

ICMPv6 error (Beyond Scope of Source Address), 
371 

of IPv6 addresses, 43 
IPv6 multicast addresses, 57-58 
of multicast addresses, 53 
SCSV (Signaling Cipher Suite Value), 884, 957 
SCTP (Stream Control Transmission Protocol) 
definition of, 957 
NAT and, 309 

transport protocols in TCP/IP suite, 16 
SCVP (Server-Based Certificate Validation Protocol) 
certificate validation and, 831 
definition of, 957 

SDID (Signing Domain Identifier), 916, 957 
SDLC (Synchronous Data Link Control) 
based on EIDLC, 131 
definition of, 957 

SDOs (standards-defining organizations), 23, 957 
SDP (Session Description Protocol) 
definition of, 957 
ICE and, 332-333 
IP multicast and, 55 
Secondary DNS servers, 517 
Secret Key Transaction Authentication for DNS 
(TSIG), 911-914 

Secs field, DHCP/BOOTP message format, 236 
Secure Entry Point (SEP) bit, DNSSEC, 896, 905, 957 
Secure Hash Algorithm 1. See SHA 1 (Secure Hash 
Algorithm 1) 

Secure hash function, 404 
Secure Neighbor Discovery. See SEND (Secure 
Neighbor Discovery) 

Secure Real-Time Protocol (SRTP), 883,958 
Secure Remote Password (SRP), 883,958 
Secure Shell. See SSH (Secure Shell) 

Secure Sockets Layer (SSL). See also TLS (Transport 
Layer Security), 876-877, 958 
Security 

ACs (attribute certificates), 831 
basic principles, 806-807 


certificates, CAs, and PKIs, 821-822 
cryptographic and cipher suites, 819-821 
cryptographic nonces and salt, 816 
cryptosystems, 809-812 
DH (Diffie-Hellman-Merkle Key Agreement), 
813-814 

ECC (Elliptic Curve Cryptography), 815 
hash functions and message digests, 817-818 
introduction to, 805-806 
message authentication codes, 818-819 
PFS (Perfect Forward Secrecy), 815 
protocols. See Security protocols 
pseudorandom numbers, generators, and func¬ 
tion families, 815-816 

public key certificates, CAs, and X.509, 822-828 
RSA (Rivest, Shamir, and Adleman) public key 
cryptography, 812-813 
signcryption, 814-815 
summary and references, 919-932 
threats to network communication, 807-809 
validating and revoking certificates, 828-831 
Wi-Fi, 129-130 

Security association database (SAD) 
definition of, 957 
in IPSec, 841-842 

Security Associations. See SAs (Security 
Associations) 

Security Gateway (SC), in IPsec, 840, 957 

Security Parameter Index (SPI), in IPsec 
definition of, 958 
IKE protocol, 844 

Security policy database (SPD), in IPsec 
definition of, 958 
in IPSec, 841-842 

Security protocols 
attacks on, 918-919 

DKIM (Domain Keys Identified Mail), 915-918 
DNS. See DNSSEC (DNS Security) 

EAP methods, 837-838 
ERP (EAP Re-authentication Protocol), 839 
Internet Key Exchange. See IKE (Internet Key 
Exchange) 

IPSec (IP Security), 840-842 
IPSec NAT traversal, 865-867 
L2TP/IPSec, 865 
layering and, 832-833 
NAC (Network Access Control), 833-837 
PANA (Protocol for Carrying Authentication for 
Network Access), 839-840 
transport layer. See TLS (Transport Layer 
Security) 

Segments Left field, in Routing header, 201-202 
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Segments, TCP, 586 

Selective acknowledgement. See SACK (selective 
acknowledgement) 

Selective retransmission, 673 
Self-clocking 
ACKs and, 731 
Nagle algorithm and, 696 
Self-describing padding, 137 

Send maximum segment size. See SMSS (send maxi¬ 
mum segment size) 

SEND (Secure Neighbor Discovery) 

Certificate option, 417 

certification path solicitation/advertisement, 
406-407 

CGAs (cryptographically generated addresses), 
403-406 

definition of, 957 

Handover Key Request/Reply options, 422-423 

ICMP attacks and, 429 

neighbor discovery options in IPv6, 414-415 

Nonce option in ND, 416-417 

overview of, 403 

RSA Signature option, 415-416 

securing IPv6 Neighbor Discovery, 292 

Timestamp option, 416 

Trust Anchor option, 417 

as variant on ND, 396 

Send window structure, sliding window protocol, 
701 

Sender pause and local congestion (event 1), TCP 
congestion control, 750-754 
Sender policy framework (SPF) records, DNS 
resource record types, 545-547 
SEP (Secure Entry Point) bit, DNSSEC, 896, 905, 957 
Sequence Control field, data frame fragmentation, 
117 

Sequence numbers 

for avoiding duplicate packets, 580 

data frame fragmentation and, 117 

GRE, 150 

PPP, 138 

TCP, 587-588 

TCP-related attacks, 641 

TCP segments, 701 

URG, 590 

Sequencing header, in MP, 138 
Sequential Port-Symmetric NAT (SP), 486 
Server alive messages, TCP keepalive attacks, 802 
Server-Based Certificate Validation Protocol (SCVP), 
831 

Server Identifier Override, DCHP relay agents, 
268-269 


Server Load Reduction (SLR), 486 
ServerHello message, in TLS, 887-889 
Servers 

accessing servers behind NAT, 314 
iterative and concurrent, 21 
server host crashes and does not reboot (keepal¬ 
ive scenarios), 796 

server host crashes and reboots (keepalive sce¬ 
narios), 797-799 

server host unreachable (keepalive scenarios), 
799-800 

Service model, TCP, 585-586 

Service provider NAT. See SPNAT (service provider 
NAT) 

Service set identifiers (SSID) 
definition of, 958 
Wi-Fi, 112 

Service sets, Wi-Fi, 112 

Service (SRV) records, DNS resource record types, 
548-549 

Session Announcement Protocol (SAP) 
definition of, 957 
for multicast sessions, 55 
Session Description Protocol. See SDP (Session 
Description Protocol) 

Session Initiation Protocol (SIP) 
definition of, 957 
ENUM records and, 551-552 
Session keys, in public key cryptography, 812 
Session layer, of OSI model, 10 
Session timers, NAT, 307-308 
Session Traversal Utilities for NAT. See STUN (Ses¬ 
sion Traversal Utilities for NAT) 

SFD (start frame delimiter), in link layer protocols 
clock recovery in Ethernet frames, 84 
definition of, 957 

SG (Security Gateway), in IPsec, 840, 957 
SHA 1 (Secure Hash Algorithm 1) 
for authentication in DHCP, 268 
definition of, 957 
overview of, 817-818 
TLS extensions, 883 
Shannon, Claude, 579 
Sharing connection state, 767-768 
Shim6 protocol, 70 

Short Interframe Space (SIFS), in Wi-Fi 
definition of, 957 
in MAC, 122 

Short sequence number, LCP options, 138 
Short-term credential mechanism, STUN, 325 
Siaddr (Next Server IP Address) field, DHCP/ 
BOOTP message format, 238, 246 
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SIFS (Short Interframe Space), in Wi-Fi 
definition of, 957 
in MAC, 122 

Signaling Cipher Suite Value (SCSV), 884, 957 
Signature verification, CGAs for, 405-406 
Signed zones, DNSSEC, 903 
Signing Domain Identifier (SDID), 916, 957 
SIIT (Stateless IP/ICMP Translation) 
definition of, 957 
IPv4/IPv6 translation, 342-344 
Silly windows syndrome. See SWS (silly windows 
syndrome) 

Simple Mail Transfer Protocol. See SMTP (Simple 
Mail Transfer Protocol) 

Simple Network Management Protocol (SNMP) 
definition of, 958 
well-known port for, 18 
Simple Object Access Protocol (SOAP) 
definition of, 958 
GENA using, 338 

Simple Service Discovery Protocol. See SSDP (Simple 
Service Discovery Protocol) 

Simple Tunneling of UDP through NATs, 319 
Simultaneous Authentication of Equals (SAE) 
definition of, 957 
Wi-Ei mesh and, 130 
Simultaneous close, in TCP connections 
overview of, 600-601 
state transition, 625 

Simultaneous open, in TCP connections 
defined, 597 
overview of, 599-600 
state transition, 625 
SIP Outbound mechanism, in ICE, 333 
SIP records, DNS resource record types, 552 
SIP (Session Initiation Protocol) 
definition of, 957 
ENUM records and, 551-552 
SLAAC (stateless address autoconfiguration) 
configuring IPv4 link-local addresses, 276 
configuring IPv6 link-local addresses, 276-277 
deciding whether to use, 244 
definition of, 957 
example of, 278-283 

IPv6 DAD (Duplicate Address Detection), 277-278 
IPv6 global addresses, 278 
overview of, 276 
stateless DHCP and, 283-284 
utility/benefit of, 284-285 
Sliding window protocol 

movement of windows, 702-704 
in packet communication, 582 
send and receive structures, 701 


TCP as, 589 

SLLAO (Source Link-Layer Address Option), 
409-410,958 
Slot time, in MAC, 122 
Slow start algorithm, in TCP 

classic algorithms for TCP congestion, 732-734 
comparing with congestion avoidance, 736-737 
limited, 772 

viewing slow start behavior with Wireshark, 
749-750 

slow start threshold. See ssthresh (slow start 
threshold) 

SLR (Server Load Reduction), 486 
Smack attacks, ICMP attacks and, 429 
Smoothed RTT. See SRTT (smoothed RTT) 

SMSS (send maximum segment size) 
definition of, 958 

SWS (silly windows syndrome) and, 709 
TCP connections and, 613 
SMTP (Simple Mail Transfer Protocol) 
definition of, 958 

MX (mail exchanger) records and, 544 
SRV record providing SMTP service, 549 
well-known port for, 18 
Smurf attacks, ICMP, 428 
SNA (System Network Architecture) 
definition of, 958 
SDLC in, 131 

Sname (Server Name) field, DHCP/BOOTP message 
format, 238-239 

SNAP (Subnetwork Access Protocol), 958 
Sniffing, 808 

SNMP (Simple Network Management Protocol) 
definition of, 958 
well-known port for, 18 
Snooping 
DHCP, 276 

IGMP/MLD, 468-469 
SNS (Symmetric NAT Support), 486 
SOA (start of authority) records 
definition of, 958 

DNS resource record types, 541-544 
SOAP (Simple Object Access Protocol) 
definition of, 958 
GENA using, 338 
sock program 

creating UDP datagram, 493,496 
generating UDP datagram with, 478-481 
restricting local IP addresses, 634 
Sockets 

popular APIs, 22 

in TCP connections, 595-596 

TCP ports, 588 
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SOCKS proxy firewalls, 302-303 
Soft state 

ARP cache timeout and, 174 
multicast information and, 441 
SOLICIT message, DHCPv6,260,269 
Solicitation messages, in MRD, 394-395 
Solicitation of servers, by clients, 435 
Sort lists, DNS, 565-567 

Source address selection algorithm, in IP host mod¬ 
els, 223-224 
Source IP addresses 

address selection by hosts, 222-223 
host processing of IP datagrams, 220-221 
ICMPv6 errors, 371-372 
in IP datagrams, 186 

Source Link-Layer Address Option (SLLAO), 
409-410,958 

Source Quench messages, TCP congestion control 
attacks, 785 

Source/Target Address List options, in ND, 413-414 

SP (Sequential Port-Symmetric NAT), 486 

Spam 

DNS resource record for fighting, 545-547 
as malware, 806 
Spanning tree, building, 107 
Spanning Tree Protocol. See STP (Spanning Tree 
Protocol) 

Spatial multiplexing, power save mode, 120 
Spatial streams, higher throughput (802.11n) sup¬ 
port, 126 

SPD (security policy database) 
definition of, 958 
in IPSec, 841-842 
Spear phishing attacks, 806 
Special-use IP addresses 
for IPv4, 50-51 
for IPv6, 51-52 

local net (limited) broadcast, 43 
SPF (sender policy framework) records, DNS 
resource record types, 545-547 
SPI (Security Parameter Index) 
definition of, 958 
IKE protocol, 844 
Split DNS, 565-567 
SPNAT (service provider NAT) 
definition of, 958 
DS-Lite and, 339 
overview of, 315-316 
Spoofing attacks 
ICMP, 429 

Internet architecture, 25 
IP addresses, 70, 226 
TCP, 640-642 

TCP keepalive attacks, 802 


Spurious association attacks, 808 
Spurious timeouts and retransmissions, in TCP 
congestion control and, 744-745 
DSACK (duplicate SACK) extension, 677-679 
Eifel Detection Algorithm, 679-680 
Eifel Response Algorithm, 680-682 
E-RTO (Forward-RTO Recovery), 680 
overview of, 677 

SRC (Source) address, in Ethernet frame format, 85 
SRP (Secure Remote Password), 883,958 
SRTP (Secure Real-Time Protocol), 883,958 
SRTT (smoothed RTT) 

classic method of RTT estimation, 651-652 
definition of, 958 
destination metrics and, 685-686 
SRV (service) records, DNS resource record types, 
548-549 

SSDP (Simple Service Discovery Protocol) 
definition of, 958 

direct interaction with NAT and firewalls, 338 
viewing in-use multicast groups in Windows 
OSs, 448 
SSIT (Secure Shell) 

for application-managed keepalives, 794 
definition of, 958 
TCP data flow and, 692 
tracing RTT of TCP connection, 697-698 
well-known port for, 18 
well-known ports for, 632 
SSID (service set identifiers) 
definition of, 958 
Wi-Fi, 112 

SSL (Secure Sockets Layer). See also TLS (Transport 
Layer Security), 876-877, 958 
SSM (source-specific multicast) 

attacks related to IGMP or MLD, 470 
definition of, 959 
IGMP and MLD supporting, 452 
MLD supporting, 390 
as multicast service model, 54 
ssthresh (slow start threshold), in TCP congestion 
control 

comparing slow start with congestion avoidance, 
736 

Eifel Response Algorithm and, 744-745 
overview of, 733 

standard TCP algorithm and, 738 
Standard RTO method, in TCP, 652-654 
Standard TCP congestion control algorithm, 728-739 
Standards 

lETP (Internet Engineering Task Eorce) in, 22-23 

link layer, 80-82 

other organizations in, 23-24 

REC (Request for Comments) and, 23-24 
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Standards-defining organizations (SDOs), 23, 957 
Standards-track category, RFCs and, 23 
Start frame delimiter (SFD) 

clock recovery in Ethernet frames, 84 
definition of, 957 

Start of authority (SOA) records, in DNS 
definition of, 958 

DNS resource record types, 541-544 
STAs (stations), in Wi-Fi 
definition of, 959 
Wi-Fi, 112 

State-change records, IGMP/MLD group member¬ 
ship reports, 457 
State machine, DHCP, 251-252 
State, storing in connection switches, 5 
State transitions, TCP 
FIN_WAIT_2 state, 625 
overview of, 616 
quiet time concept, 624 
simultaneous open and close transitions, 625 
state transition diagrams, 617-618 
TIME_WAIT state (2MSL), 618-624 
Stateful translation, IPv4/IPv6, 344-345 
Stateless address autoconfiguration. See SLAAC 
(stateless address autoconfiguration) 
Stateless IP/ICMP Translation (SIIT) 
definition of, 957 
IPv4/IPv6 translation, 342-344 
Stateless mode, DHCPv6,283-284 
Static multiplexing, 4 

Station-to-Station (STS) protocol, relation to DH 
(Diffie-Hellman), 814 
Statistical multiplexing, 4 
STODER, repacketization and, 686 
"Stop and wait" protocol 

communication protocols and, 581 
Nagle algorithm and, 697 
TCP and, 696 

STP (Spanning Tree Protocol), in bridges 
BPDU structure, 105-107 
building the spanning tree, 107 
definition of, 959 
example of, 107-109 
handling topology changes, 107 
overview of, 102-104 
port states and roles, 104-105 
RSTP (Rapid Spanning Tree Protocol), 110-111 
Straightforward NAPTR (S-NAPTR) 
definition of, 958 
DNS resource record types, 554 
Stream ciphers, symmetric key ciphers, 811 
Stream Control Transmission Protocol. See SCTP 
(Stream Control Transmission Protocol) 


Stretch ACKs, 754-757 
Strong host model, 220 

STS (Station-to-Station) protocol, relation to DH 
(Diffie-Hellman), 814 

STUN (Session Traversal Utilities for NAT) 
attributes defined by TURN, 328 
binding method, 321 
definition of, 959 
ICE making use of, 332-334 
mechanisms, 325-326 
message formats, 320 
Teredo servers compared with, 482 
Subdomains, in DNS hierarchy, 514 
Subnet addressing, 36-39 
Subnet broadcast addresses. See Broadcast 
addresses 

Subnet fields, in IP addresses, 37 
Subnet masks 

overview of, 39-41 

VLSM (variable-length subnet masks), 41-42 
Subnetwork Access Protocol (SNAP), 105,958 
Subnetworks, 37 
Switches and bridges 
attacks on, 155 
layer 2 relay agents and, 270 
link layer and, 98-102 
in small networks, 11-13 
VLAN, 90 

SWS (silly windows syndrome) 
definition of, 959 
example of avoiding, 709-715 
overview of, 708 
rules for avoiding, 708-709 
Symmetric key encryption 

cryptographic algorithms, 809-811 
KDE (key derivation function) in, 815 
Symmetric NAT Support (SNS), 486 
SYN bit field, TCP header, 589-590 
SYN cookies, in TCP 

attacks related to window management and, 723 
TCP-related attacks, 640-641 
SYN floods, TCP-related attacks, 640 
SYN segments, in TCP 

combined with ACKs (SACK), 607 
definition of, 959 

establishing TCP connections and, 602-603 
MSL (maximum seqment life), 610 
NAT and TCP, 307-308 

requesting connection to nonexistent TCP port, 
626 

in TCP connections, 596-597 
TCP header and, 589 

WSCALE (Window Scale) option and, 608 
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Synchronous Data Link Control (SDLC) 
based on HDLC, 131 
definition of, 957 
SYN.RCVD state, in TCP 

incoming connection queue and, 636 
simultaneous open and close transitions, 625 
TCP state transitions, 618 
SYN.SENT state, in TCP 

simultaneous open and close transitions, 625 
TCP state transitions, 618 
System configuration options 

attacks related to system configuration, 292 
autoconfiguration. See SLAAC (stateless address 
autoconfiguration) 

DHCP (Dynamic Host Configuration Protocol). 
See DHCP (Dynamic Host Configuration 
Protocol) 

introduction to, 233-234 
summary and references, 292-298 
System Network Architecture (SNA) from IBM 
definition of, 958 
SDLC in, 131 

T 

T1 (Renewal time), for DHCP messages, 240 
Tahoe algorithm, TCP congestion control, 737-738 
Tarpits, attacks related to window management, 723 
Tayor, Bob, 2 

tc program, for packet scheduling and traffic con¬ 
trol subsystem in Linux, 752 
TC (Topology Change), in BPDU structure, 106 
TCA (Topology Change Acknowledgment), 106 
TCN (topology change notification), 107 
TCP-AO (Authentication Option) 
definition of, 959 
TCP header, 612 
TCP congestion control 

active queue management and ECN, 782-785 
attacks related to, 785-786 
BIC (Binary Increase Congestion Control), 
772-774 

buffer bloat, 781-782 
classic algorithms for, 730-732 
comparing slow start with congestion avoidance, 
736-737 

congestion avoidance algorithm, 734-736 
connection completion and, 766-767 
CTCP (Compound TCP) algorithm, 779-781 
CUBIC, 775-776 

CWV (Congestion Window Validation), 742-744 

delay-based, 777 

example of handling, 745-749 


PACK (forward acknowledgment) and rate halv¬ 
ing for, 741-742 

fast retransmit and local congestion, 759-762 
fast retransmit and SACK recovery, 757-759 
FAST TCP algorithm, 778-779 
handling spurious RTOs, 744-745 
in high-speed environments, 770 
HSTCP (Highspeed TCP), 770-772 
introduction to, 727-728 
limited transmit approach to, 742 
NewReno algorithm for, 739-740 
SACK (selective acknowledgement) for, 740-741 
sender pause and local congestion (event 1), 
750-754 

sharing connection state, 767-768 
slow start algorithm, 732-734 
slow start behavior, 749-750 
slowing down TCP senders, 729-730 
standard TCP algorithm, 728-739 
stretch ACKs and recovery from local conges¬ 
tion, 754-757 

summary and references, 786-792 
Tahoe, Reno, and Fast Recovery algorithms, 
737-738 

TCPW (TCP Westwood) algorithm, 779 
TFRC (TCP Friendly Rate Control), 768-770 
timeouts, retransmissions, and undoing cwnd 
changes, 762-766 
Vegas TCP algorithm, 777-778 
TCP data flow 

attacks related to window management, 723 
delayed ACK interaction with Nagle algorithm, 
699 

delayed ACKs, 695-696 
disabling Nagle algorithm, 699-700 
example of dynamic window size adjustment 
and flow control, 705-708 
example using urgent mechanism, 720-722 
flow control, 700-701 
interactive communication, 692-695 
introduction to, 691 
large buffers and auto-tuning, 715-719 
Nagle algorithm, 696-698 
sliding window protocol, 701-704 
summary and references, 723-725 
SWS (silly windows syndrome), 708-715 
urgent mechanism, 719-720 
zero windows and TCP persistent timer, 704-705 
TCP Friendly Rate Control (TFRC), 768-770, 959 
TCP/IP suite 
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implementations and distributions, 24-25 
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multiplexing, demultiplexing, and encapsula¬ 
tion, 16-17 

names, addresses, and DNS, 19 
OSI model compared with, 8-9 
overview of, 13 
port numbers, 17-19 
TCP keepalive 

attacks related to, 802 
description of, 795-797 
introduction to, 793-794 

server host crashes and does not reboot, 797-799 
server host crashes and reboots, 799-800 
server host unreachable, 800-802 
summary and references, 802-803 
TCP segments, 15 
TCP servers 

incoming connection queue, 636-640 
overview of, 631-632 
port numbers and, 632-634 
restrictions on foreign endpoints, 635-636 
restrictions on local IP addresses, 634-635 
TCP timeout/retransmission 
attacks related to, 687 
classic RTO method, 651-652 
clock granularity and RTO bounds, 654 
congestion control and, 762-766 
connection establishment and, 604-605 
destination metrics, 685-686 
DSACK (duplicate SACK) extension, 677-679 
Eifel Detection Algorithm, 679-680 
Eifel Response Algorithm, 680-682 
example of, 648-651 
example of fast retransmit, 668-671 
example of retransmission with SACK, 673-676 
example of timer-based retransmission, 665-667 
E-RTO (Forward-RTO Recovery), 680 
fast retransmit, 667-668 
introduction to, 647-648 
Linux RTT estimation, 657-661 
packet duplication, 684-685 
packet reordering, 682-684 
repacketization, 686-687 

retransmission ambiguity and Karn's algorithm, 
655 

retransmission with SACK, 671-672 
robustness of RTTM, 662-664 
RTO (retransmission timeout) setting, 651 
RTT estimation behaviors, 661-662 
RTTM (RTT Measurement) with Timestamps 
option, 656-657 


SACK receiver behavior, 672 
SACK sender behavior, 673 
spurious timeouts and retransmissions, 677 
standard RTO method, 652-654 
summary and references, 688-690 
timer-based retransmission, 664-665 
TCP (Transmission Control Protocol) 

ARQ as basis of, 579 

connection management. See Connections, TCP 
definition of, 959 

encapsulation in IP datagrams, 587 
flow control and, 7-8 
header fields, 588-590 
introduction to, 584-585 
NAT and, 306-308 
reliability, 586-587 
service model, 585-586 
STUN and, 320 

summary and references, 591-593 
transport protocols in TCP/IP suite, 15 
well-known ports for, 525-526 
TCP Westwood-i- (TCPW-i-) algorithm, 777 
TCP Westwood (TCPW) algorithm, 779 
tcpdump command 

connecting to Web server on host, 171 
ICMP destination unreachable messages, 480 
not converting IP addresses to machine names, 
479 

viewing UDP fragmentation, 490-491 
TCP_NODELAY option, for disabling Nagle algo¬ 
rithm, 700 

tcptrace, connection statistics with, 745-747 
TCPW-i- (TCP Westwood-i-) algorithm, 777 
TCPW (TCP Westwood) algorithm, 779 
TDM (time-division multiplexing), 4, 959 
Teardrop attacks 
ICMP, 428 
UDP, 506 

telnet command 

connecting to Web server on host, 171 
establishing TCP connections, 602 
Telnet program 

SSH replacing, 692 
well-known port for, 18 

Temporal Key Integrity Protocol (TKIP), in Wi-Fi, 
129-130 

Temporary addresses, in DHCPv6, 255-256 
Tentative state, IPv6 addresses, 253 
Teredo, tunneling IPv6 over IPv4 
IPv4/IPv6 translation, 339 
relays and servers, 482 
tunneling, 154, 482-487 
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Termination messages 
in TCP operation, 134 
in MRD, 394-395 

Termination, of TCP connections, 595-598 
TFC (Traffic Flow Confidentiality), 858 
TFN (Tribe Flood Network), 429 
TFRC (TCP Friendly Rate Control), 768-770, 959 
TFTP (Trivial File Transfer Protocol) 
definition of, 959 

ICMP port unreachable messages and, 366-370 
Threats, to network communication, 807-809 
Three-way handshake, 597, 640 
Throughput (802.11n), Wi-Fi, 126-128 
Time-division multiplexing (TDM), 4, 959 
Time exceeded message, ICMP, 375-378 
Time-Remaining messages, in TCP operation, 134 
Time sync function. See TSF (time sync function) 
Time-to-live. See TTL (Time-to-live) 

TIME-WAIT Assassination (TWA), 630-631, 960 
Timed wait (MSL), 618 

Timeouts, TCP. See TCP timeout/retransmission 
Timer-based retransmission 
example of, 665-667 
introduction to, 647 
overview of, 664-665 

Timestamp Echo Reply. See TSER (Timestamp Echo 
Reply) 

Timestamp Request/Replay message, ICMP attacks 
and, 429 

Timestamp Value. See TSV (Timestamp Value) 
Timestamps option. See also TSOPT (timestamps 
option) 

neighbor discovery in IPv6,416 
TCP header, 608-610 
TIME_WAIT state (2MSL), in TCP 
overview of, 618-624 
TCP state transitions, 624 
TWA (TIME-WAIT Assassination), 630-631 
Tinygrams, 696 

TKIP (Temporal Key Integrity Protocol), in Wi-Fi, 
129-130 

TLDs (top-level domains) 
definition of, 959 
in DNS name space, 512 
name servers for, 517 
TLS (Transport Layer Security) 
with datagrams (DTLS), 884-891 
definition of, 959 
DTLS DoS protection, 894 
DTLS handshake protocol, 892-894 
DTLS record layer, 891-892 
example of use of, 884-891 


extensions, 883-884 
handshaking protocols, 880-883 
HTTP/HTTPS and, 18 
overview of, 876-877 
Record protocol, 878-880 
renegotiation of cryptographic connection 
parameters, 884 
TCP with, 320 
TLS 1.2, 877-878 
TLV (type-length-value) sets 
definition of, 959 
IPv6 options held as, 196-197 
Top-level domains. See TLDs (top-level domains) 
Topology Change Acknowledgment (TCA), 106 
Topology change notification (TCN), 107 
Topology Change (TC), in BPDU structure, 106 
Topology changes, STP handling, 107 
ToS (Type of Service) byte 
definition of, 959 

ICMP Parameter Problem and, 379 
in IPv4,183,188-189 
redefined as DSCP/ECN fields, 379 
Total Length field 

ICMP Parameter Problem and, 379 
in IP header, 183-184 
TPDU (transport PDU), 10 
traceroute, for determining routing path, 
376-378 

Traffic analysis, types of threats to network com¬ 
munication, 808 

Traffic Class byte, in IPv6,183,188-189 
Traffic Flow Confidentiality (TFC), 858 
Traffic selectors (TS) 
definition of, 960 
IKE, 851, 873 

Traffic specification (TSPEC), in Wi-Fi QoS 
definition of, 960 
in HCCA, 123 

traffic visibility, ESP (Encapsulating Security Pay- 
load), 863-864 

Transacation authentication, in DNS, 911-915 
Transaction Signatures (TSIG), in DNS 
definition of, 960 

transaction authentication in DNSSEC, 911-914 
Transient session keys (TSKs), 838 
Translating 

DNS from IPv4 to IPv6, 568-569 
ICMPv4 to ICMPv6, 424-426 
ICMPv6 to ICMPv4,426-428 
IPv4 to IPv6, 482 

UDP/IPv4 and UDP/IPv6 datagrams, 505-506 
Translation behavior, NAT, 312 
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Translation functions, NAT, 305 
Translators, TCP connections, 605 
Transmission Control Protocol. See TCP (Transmis¬ 
sion Control Protocol) 

Transmit opportunities (TXOPs), in Wi-Fi QoS 
in DCF, 123 
definition of, 960 
Transparency, DNS, 567-568 
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layering violation, 476 
of OSl model, 9-10 

security. See TLS (Transport Layer Security) 
transport protocols in TCP/IP suite, 15-16 
UDP checksum, 475-476 

Transport Layer Security. See TLS (Transport Layer 
Security) 

Transport PDU (TPDU), 10 
Transport protocols, 309 
Traversal, NAT, 316 

Traversal Using Relays around NAT. See TURN 
(Traversal Using Relays around NAT) 

Tribe Flood Network (TFN), 429 
Triple-DES (3DES) 
definition of, 959 

standardized for Internet use, 819 
as symmetric encryption algorithm, 811 
Trivial Eile Transfer Protocol (TPTP) 
definition of, 959 

ICMP port unreachable messages and, 366-370 
Trunking, VLAN switches and, 90 
Trust anchors 

CAs (certification authorities) and, 822 
in ND, 417 

SEND (Secure Neighbor Discovery), 403 
TS (traffic selectors), in IPsec 
definition of, 960 
IKE, 851,873 

TSER (Timestamp Echo Reply), in TCP 
definition of, 960 

Eifel Detection Algorithm and, 679 
TCP Timestamp option and, 609 
timer-based retransmission and, 665-666 
TSF (time sync function), in Wi-Pi 
in 802.11 specification, 119-120 
definition of, 960 
Wi-Ei frames and, 114 
TSIG (Transaction Signatures), in DNS 
definition of, 960 

transaction authentication in DNS, 911-914 
TSKs (transient session keys), 838 
TSOPT (timestamps option), in TCP 
definition of, 960 


Eifel Detection Algorithm using, 679 
Linux RTT estimation and, 657 
robustness of RTTM to loss and reordering, 
662-664 

RTTM (RTT Measurement) with, 656-657 
TCP header, 608-610 

TSPEC (traffic specification), in Wi-Fi QoS 
definition of, 960 
in HCCA, 123 

TSV (Timestamp Value), in TCP 
definition of, 960 

Eifel Detection Algorithm and, 679 
RTTM with Timestamps option, 656 
TCP Timestamp option and, 608-609 
TTL (Time-to-live) 
definition of, 960 

ICMP Time Exceeded message, 375,378 
IP header fields, 184 

MRD (Multicast Router Discovery) and, 394 
name servers, 517 
QS (Quick-Start) TTL, 199 
SYN segments, 611 
Tunnel endpoint, IPv6 traffic and, 46 
Tunneled packets, NAT and, 310 
Tunneling 

lPv4/IPv6 translation, 339 
IPv6 options for, 198 
link layer and, 149-153 
link layer attacks and, 156 
Tunneling proxy servers, 302 
TURN (Traversal Using Relays around NAT) 
definition of, 960 
ICE making use of, 332-334 
overview of, 326-332 
Teredo relays compared with, 482 
TWA (TIME-WAIT Assassination), 630-631, 960 
TXOPs (transmit opportunities), in Wi-Fi QoS 
in DCF, 123 
definition of, 960 
TXT records 

definition of, 960 

DNS resource record types, 545-547 
Type field, in Ethernet frame format, 85-86 
Type-length-value (TLV) sets 
definition of, 959 
IPv6 options held as, 196-197 
Type of Service byte. See ToS (Type of Service) byte 

u 

U-NAPTR (URl-enabled NAPTR) 
definition of, 960 
DNS resource record types, 555 
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U-NII (Unlicensed National Information 
Infrastructure) 

5GHz band for, 124 
definition of, 960 

UBM (unicast-prefix-based multicast) 
allocation of IPv4 addresses, 56 
definition of, 960 
UDLs (unidirectional links) 
definition of, 960 
link layer and, 153-154 
UDP-Lite, 487-488 
UDP servers 

designing, 498-499 

flow control and congestion control in server 
design, 505 

foreign IP address restrictions in server design, 
502-503 

IP addresses and port numbers in server design, 

499- 500 

local IP address restrictions in server design, 

500- 501 

multiple addresses in server design, 501-502 
multiple servers per port, 503-504 
spanning IP address families in server design, 
504 

UDP (User Datagram Protocol) 
attacks related to, 507-508 
broadcast overhead and, 451 
checksum, 475-478 
connection refused error, 626 
as connectionless protocols, 595 
definition of, 960 
examples, 478-481 

flow control and congestion control in server 
design, 505 

foreign IP address restrictions in server design, 
502-503 
header, 474-475 
ICE and, 332 

interaction between IP fragmentation and ARP/ 
ND, 496-497 
in the Internet, 506-507 
introduction to, 473-474 

IP addresses and port numbers in server design, 

499- 500 

IP fragmentation and, 488-492 
IPv6 and, 481-482 

local IP address restrictions in server design, 

500- 501 

maximum UDP datagram size, 497-498 
multiple addresses in server design, 501-502 
multiple servers per port, 503-504 
NAT and, 308-309 


PMTUD (Path MTU Discovery) with, 493-496 
reassembly timeout, 492 
sending broadcast datagrams, 439 
server design, 498-499 

spanning IP address families in server design, 504 
STUN and, 320 

summary and references, 508-510 
Teredo tunneling and, 482-487 
translating UDP/IPv4 and UDP/IPv6 datagrams, 
505-506 

transport protocols in TCP/IP suite, 15 
UDP-Lite, 487-488 
well-known ports for, 525-526 
UEQM (unequal modulation), in 802.11n 
definition of, 960 

higher throughput (802.11n) support and, 127 
ULAs (Unique Local IPv6 Unicast Addresses) 
definition of, 960 
NAT and, 310 
overview of, 225 
Unauthorized access attacks, 26 
Unequal modulation (UEQM), in 802.11n 
definition of, 960 

higher throughput (802.11n) support and, 127 
Unicast addresses 
allocation of, 62-65 
anycast addresses, 62 
assigning, 65-66 
C class spaces for, 35 
definition of, 34 

Echo Request message sent from link-local uni¬ 
cast address, 445-446 

IIDs as basis for unicast IPv6 addresses, 43-46 
multiple providers/multiple networks/multiple 
addresses, 68-70 
overview of, 15 

single provider/multiple networks/multiple 
addresses, 67-68 

single provider/no network/single address, 
66-67 

single provider/single network/single address, 

67 

Unicast-prefix-based IPv6 multicast addresses, 58 
Unicast-prefix-based multicast (UBM) 
allocation of IPv4 addresses, 56 
definition of, 960 

Unicode, internationalization of Internet, 512 
Unidirectional links (UDLs) 
definition of, 960 
link layer and, 153-154 
Uniform Resource Locator (URL), 961 
Unilateral self-address fixing. See UNSAF (unilateral 
self-address fixing) 
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Unique Local IPv6 Unicast Addresses. See ULAs 
(Unique Local IPv6 Unicast Addresses) 
Universal Plug and Play (UPnP) framework 
definition of, 961 

direct interaction with NAT and firewalls, 
337-339 

Universal Resource Identifier. See URI (Universal 
Resource Identifier) 

UNIX 

Berkeley version. See BSD (Berkeley Software 
Distribution) 
rlogin, 692 

Unlicensed National Information Infrastructure 
(U-NII) 

5GHz band for, 124 
definition of, 960 

Unreachable hosts, keepalives detecting, 795-796 
UNSAF (unilateral self-address fixing) 
definition of, 961 
overview of, 317-319 

STUN (Session Traversal Utilities for NAT), 319-326 
Updates 

DNS Update, 567 
dynamic DNS updates, 555-558 
UPnP (Universal Plug and Play) framework 
definition of, 961 

direct interaction with NAT and firewalls, 337-339 
Upper layer, TLS (Transport Layer Security), 877 
UPs (user priorities), in Wi-Fi QoS, 123, 961 
URG (Urgent Mechanism), in TCP 
definition of, 961 

example working with urgent data, 720-722 
overview of, 719-720 
TCP header, 590 

URI-enabled NAPTR (U-NAPTR) 
definition of, 960 
DNS resource record types, 555 
URI (Universal Resource Identifier) 
definition of, 961 
ENUM records and, 551-552 
NAPTR records and, 549 
URI/URN resolution, 553-554 
URL (Uniform Resource Locator), 961 
URN resolution, 553-554 
Usage-specific keys (USRK), in EAP 
definition of, 961 
key derivation in EAP, 838 
User Datagram Protocol. See UDP (User Datagram 
Protocol) 

User priorities (UPs), in Wi-Fi QoS, 123, 961 
User Timeout (UTQ) option, in TCP 
definition of, 961 
TCP header, 611-612 


USRK (usage-specific keys) 
definition of, 961 
key derivation in EAP, 838 
UTC (Coordinated Universal Time), 961 
UTQ (User Timeout) option, in TCP 
definition of, 961 
TCP header, 611-612 
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vconf ig command, for manipulating 802.1p/q 
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VCs (virtual circuits) 
definition of, 961 
multiplexing and, 4 
Vegas TCP algorithm, 777-778 
Vendor Extension field, DHCP/BOQTP message 
format, 238, 246 
VENONA system, 918 
Virtual carrier sense, 121 
Virtual circuits (VCs) 
definition of, 961 
multiplexing and, 4 

Virtual LANs. See VLANs (virtual LANs) 

Virtual private networks. See VPNs (virtual private 
networks) 

Viruses, 806 

VJ (Van Jacobson) compression, 141-142 
VLAN identifier, 90 
VLAN tag, 90 
VLANs (virtual LANs) 
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